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Preface 



This volume contains the revised versions of selected papers presented during 
the 32nd Annual Conference of the German Classification Society (Gesellschaft 
fiir Klassifikation - GfKl). The GfKl 2008 conference, which was organized in 
cooperation with the British Classification Society (BCS) and the Dutch/Flemish 
Classification Society (VOC), was hosted by Helmut-Schmidt-University, Hamburg, 
Germany, in July 2008. The focus of the conference was Data Analysis, Data 
Handling, and Business Intelligence. The conference featured 13 invited lectures 
(3 plenary speeches and 10 semi-plenary lectures), 166 contributed talks, 4 invited 
sessions, and 2 workshops. With 275 participants from 22 countries in Europe and 
overseas this GfKl Conference, once again, provided an international forum for dis- 
cussions and mutual exchange of knowledge with colleagues from different fields of 
interest. From 95 full papers that had been submitted for this volume 71 submissions 
were finally accepted. 

The scientific program included a broad range of topics. Interdisciplinary research 
and the interaction between theory and practice were particularly emphasized. The 
following sections (with chairs in alphabetical order) were established: 

Theory and Methods: 

Exploratory Data Analysis (R. Wehrens); Clustering and Classification (H.-H. Bock 
and M. Vichi); Optimization in Statistics (G. Ritter); Pattern Recognition and 
Machine Learning (P.J.F. Groenen and E. Hiillermeier); Visualization and Scaling 
Methods (C. Hennig and M. van de Velden); Mixture Analysis (A. Montanari and 
W. Seidel); Bayesian, Neural, and Fuzzy Clustering (R. Kruse and H.A. Le Thi); 
Computational Intelligence and Metaheuristics (A. Fink); Evaluation of Clustering 
Algorithms and Data Structures (E. Leisch). 

Application Fields: 

Subject Indexing and Library Science (H.-J. Hermes and B.W.J. Lorenz); Marketing 
and Management Science (R. Decker and D. van den Poel); Collective Intelli- 
gence (A. Geyer-Schulz); Text Mining (W. Gaul and L. Schmidt-Thieme); Banking 
and Finance (H. Locarek-Junge); Market Research, Controlling, and OR (D. Baier, 
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Y. Boztug, and W. Steiner); Biostatistics and Bioinformatics (A. Benner and 
A.-J. Boulesteix); Genome and DNA Analysis (H.-P. Klenk); Medical and Health 
Sciences (B. Lausen); Archaeology (T. Kerig and I. Herzog); Processes in Industry 
(F. Joos); Spatial Planning (M. Behnisch); Linguistics (H. Goebl and P. Grzybek); 
Statistical Musicology (C. Weihs); Education and Psychology (S. Krolak-Schwerdt). 

Invited sessions were organized by colleagues from associated societies, namely 
the British Classihcation Society (BCS, C. Hennig and F. Murtagh) and the Dutch/ 
Flemish Classification Society (VOC, M. van de Velden and R. Wehrens). 
Additionally, two invited sessions were organized on the topics PLS Path Mod- 
eling (V. Esposito Vinzi) and Microarrays in Clinical Research (B. Lausen and 

A. Ultsch). Furthermore, there was a pre-conference workshop on “Data Quality: 
Defining, Measuring and Improving” (H.-J. Lenz) and a workshop on “Libraries 
and Decimal Classification” (H.-H. Hermes). 

The editors would like to thank the section chairs for doing a great job in orga- 
nizing their sections. Cordial thanks for the paper reviews go to the section chairs, 
the additional members of the scientific program committee (V. Esposito Vinzi, 
W. Esswein, C. Fantapie Altobelli, H. Hebbel, K. Jajuga, A. Okada, D. Steuer, 
U. Tiishaus, and I. van Mechelen) as well as the additional reviewers (W. Adler, 
T. Augustin, L.B. Barsotti, M. Behnisch, A. Brenning, D.G. Calo, G.P Celeux, 

B. Garel, E. Gassiat, E. Godehardt, L. Haberle, E. Haimerl, U. Henker, H. Holzmann, 
T. Kerig, R. Klar, S. Lessmann, H. Lukashevich, S. Matos, FR. McMorris, F. Meyer, 
F. Morchen, H.-J. Mucha, O. Opitz, V. Patilea, S. Potapov, S. Santana, T. Scharl, 
M. Schwaiger, K. Sever, K. Sommer, R. Stecking, C. Strobl, N.X. Thinh, and 
M. Trzesiok). 

Based on the reviews and a further in-depth examination, two papers were 
granted the best paper award: Bernard Haasdonk and Elzbieta P^kalska: Classi- 
fication with Kernel Mahalanobis Distance Classifiers; Wiebke Petersen: Linear 
Coding of Non-linear Hierarchies - Revitalization of an Ancient Classification 
Method. 

The great success of the GfKl 2008 conference would not have been pos- 
sible without the support of the members of the local organization committee 
(C. Fantapie Altobelli, A. Fink, H. Hebbel, W. Seidel, D. Steuer, and U. Tiishaus) 
and many other people mainly working backstage. As representatives of the whole 
team, we would like to particularly thank Y. Kollner and M. Schafer for their 
exceptional efforts and great commitment with respect to the preparation and 
organization of the conference. The GfKl Conference 2008 was facilitated by 
the financial and/or material support of the following institutions and companies 
(in alphabetical order): Deutsche Forschungsgemeinschaft, Freunde und Forderer of 
Helmut-Schmidt-Universitat Hamburg, Gesellschaft fiir Klassifikation, Gesellschaft 
fiir Konsumforschung, Hamburg-Mannheimer Versicherungen, Hamburger 
Sparkasse, Helmut-Schmidt-Universitat Hamburg, StatSoft GmbH, Vattenfall, and 
Volksfiirsorge Deutsche Lebensversicherung AG. We express our gratitude to all of 
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them. We would also like to thank M. Bihn from Springer- Verlag, Heidelberg, for 
her support and dedication to the production of this volume. 

In closing we wish to thank all the authors who contributed to this book. Without 
their work and passion, this book would not have been possible. 



Hamburg and Marburg (Germany), 
and Colchester (UK), 

March 2009 



Andreas Fink 
Berthold Lausen 
Wilfried Seidel 
Alfred Ultsch 
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Semi-supervised Probabilistic Distance 
Clustering and the Uncertainty of Classification 



Cem lyigun and Adi Ben-Israel 



Abstract Semi-supervised clustering is an attempt to reconcile clustering 
(unsupervised learning) and classification (supervised learning, using prior informa- 
tion on the data). These two modes of data analysis are combined in a parameterized 
model, the parameter 9 e [0, 1] is the weight attributed to the prior information, 
0 = 0 corresponding to clustering, and 0 = 1 to classification. The results 
(cluster centers, classification rule) depend on the parameter 0, an insensitivity 
to 0 indicates that the prior information is in agreement with the intrinsic cluster 
structure, and is otherwise redundant. This explains why some data sets (such as 
the Wisconsin breast cancer data, Merz and Murphy, UCI repository of machine 
learning databases, University of California, Irvine, CA) give good results for all 
reasonable classification methods. The uncertainty of classification is represented 
here by the geometric mean of the membership probabilities, shown to be an 
entropic distance related to the Kullback-Leibler divergence. 

Keywords Breast cancer data ■ Classification ■ Classification uncertainty • 
Clustering ■ Contour approximation of data • Diabetes data • Entropy • Home range ■ 
Kullback-Leibler divergence • Probabilistic clustering • Semi-supervised learning. 



1 Introduction 
1.1 Clustering 

A cluster is a set of elements that are similar in some sense. In what follows, data 
points are considered as points in a metric space with a distance function d, and 
“similar” is taken as “close”, the data points x,y are similar if d{x.y) is small. 
Clustering, the process of identifying clusters with dissimilar elements in different 
clusters, is modeled here as an optimization problem 
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min F(p, c), 

P.c 



(P.O) 



with an objective function F, and variables p (probabilities) and c (centers) that are 
explained below. 

1.2 Classification 

Here a population, or a data set, V is partitioned into several disjoint classes, but the 
class to which an element belongs may be unknown, and needs to be determined. 
Data points are of the form (x, y), where x is a vector of observations, and the label 
y is the (usually unknown) class where x belongs. A classification rule is a function 
that assigns class values y to observations x from T). It is learned from sample data 
for which the class labels are known (the prior information). Classification is the 
process of deriving such a classification rule. 

A common protocol is to learn the classification rule from a randomly selected 
subset T of T> (the training set), then test it on the remaining data D\T (the testing 
set), recording the percentage of correct classifications as a performance criterion of 
the rule. 

We model classification as an optimization problem 



where the prior information on the data is incorporated in the objective function G. 

Example L Medical diagnostics. Here I? is a medical data set with data points 
(x, y), X the vector of test results for a given patient (abbreviated the patient), and 
y the medical status, say y = 1 if disease is present, y = 0 otherwise (two classes, 
regardless of the intrinsic structure of the data). A classification rule ??(•) may result 
in error, say r](x) = 0 for a patient x with disease (false negative), or r](x) = 1 for 
a patient x that is disease free (false positive). In general, these two classification 
errors have different consequences, and one may want to reduce one at the cost of 
increasing the other. 

A well known medical data set, the Wisconsin breast cancer data set (Merz & 
Murphy, 1996), has attracted much research, see, e.g., Mangasarian, Setiono, and 
Wolberg (1999), Wolberg and Mangasarian (1990). Interestingly, all classification 
methods tried on this data set gave good results, see, e.g., Lim, Loh, and Shih (2000). 
This is explained in Example 3 below. 

1.3 Learning 

Classification is also called supervised learning to indicate that prior information 
is available. In contrast, cluster analysis relies only on the intrinsic structure and 
geometry of the data, and is called unsupervised learning. 



min G(p, c), 

p.c 



(P.l) 
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Another difference is that the number of classes is given in classification prob- 
lems, while in clustering the “right” number of clusters for the data in question may 
not be given, and has to be determined. 

1.4 Semi-supervised Clustering 

Semi-supervised clustering is an attempt to reconcile clustering and classification, 
two contrasting modes of data analysis. The semi- supervised clustering problem is 
modeled here as a parametric family of optimization problems, using a parameter 
0 e [0. 1] that expresses the weight attributed to the prior information, 

min{(l — 0) f(p, c) -h 0 G(p, c)}. (P.6) 

p.c 

The clustering problem (P.O) and the classification problem (P.l) are special cases, 
for 6 = 0 and 6=1, respectively. 

For other approaches to semi- supervised clustering see, e.g., Chapelle, Scholkopf, 
and Zien (2006), Grira, Crucianu, and Boujemaa (2005) and Jain, Murty, and Flynn 
(1999). 



1.5 Matching Labels 

The optimal solutions of (P.0) depend on 6, and are denoted {p*(0), c*(0)}. A data 
set with prior information is said to have well-matching labels if {p*(0), c*(0)} 
are insensitive to 6. In this case, the prior information is in agreement with the 
intrinsic structure of the data set, and the clusters of the problem (P.O) may be used 
to derive the classification rule for the problem (P.l). In the opposite case, where 
{p*(0), c*{6)} are sensitive to 6, the labels are said to be ill-matching. 



1.6 Plan of This Paper 



Probabilistic distance clustering and classification (using prior information) are out- 
lined in Sects. 2 and 3, respectively. Semi-supervised clustering is introduced in 
Sect. 4 as a parametric family of convex combinations of the clustering and clas- 
sification problems, with a parameter 6 indicating the importance placed on the 
prior information. The algorithm proposed in Sect. 4.4 updates the cluster centers as 
convex combinations of the data points. Section 5 illustrates the dependence of the 
results on the parameter 6, for a synthetic example with ill-matching labels, Exam- 
ple 2, and two medical data sets. Example 3. The cluster probabilities are studied in 
Appendix 1 as functions of distances, justifying the probabilistic model of Sect. 2.3. 
The classification uncertainty function of Sect. 2.5 is shown in Appendix 2 to be an 
entropic distance, associated with the Kullback-Leibler relative entropy. 
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2 Probabilistic Distance Clustering 

2.1 Notation 

Let 1 ,n denote the index set {\. 2, ... ,n}, and V the qualifier /or all. 

Consider a data set T) with N points x, , i e 1,A^, and K clusters, 

T) = C\ U C 2 ■ ■ ■ U Ck, Cjt n Cf = 0 if /c ^ 

The data points have n components (attributes), and are formally considered as 
elements of an « -dimensional real space M" (although the vector sum of two data 
points is not necessarily a data point). 

The k th-cluster Ck has a center Ck (to be computed), and a distance function 
dk( - ,Ck), in particular the elliptic distance 



dk(x,Ck):= {{x-Ck),Qk(x-Ck))^^^, (1) 

where (x, y) is the standard inner product of vectors x, y e K" , and the geometry of 
the cluster is modeled by the positive definite matrix matrix Qk- We often use the 
Mahalanobis distance, 

dk(x,Ck):= i(x-Ck),T,~\x-Ck))^^^, (2) 

where Ejt is the covariance matrix of Ck, see, e.g., Bar-Hillel, Hertz, Shental, and 
Weinshall (2005), Xing, Ng, Jordan, and Russell (2003). 

The space K" is thus endowed with K metrics. Both centers and distance func- 
tions are updated by the clustering algorithm, see Sect. 2.8 below. We abbreviate 
dk(x,Ck) by dk(x). 



2.2 Probabilistic Clustering 

In probabilistic (fuzzy or soft) clustering the assignment of points to clusters is 
not deterministic, and is given as probability, see, e.g., Bezdek (1981), Hoppner, 
Klawonn, Kruse, and Runkler (1999). Let pk(x) denote the probability that the 
point X belongs to the cluster Ck- This notation allows for deterministic member- 
ship, expressed by pk(x) = 1. The function pk(-) is also called the membership 
function of Ck . 



2.3 Probabilistic Distance Clustering 

\n probabilistic distance clustering the, me,mhsrs\\vpprohabi\i\ies{pk(x) : k e 1,X} 
depend on the distances dk(x) to the clusters. A reasonable assumption is 
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membership in a cluster is more likely the closer it is (A) 

see Appendix 1 for details. A simple way to model this assumption is 

Pk(x) dk{x) = D(x), V A: e ij, (3) 

for any x, where the function D{-) is independent of the cluster. 

There are other ways to model Assumption (A), hut the simple model (3) works 
well enough for our purposes. 



2.4 Probabilities and the Joint Distance Function 



From (3), and the fact that the probahilities {pk{x)} add to one, it follows that 



Pk{x) = 



n dj(x) 
E n dj{x) 



k G 1,^, and D(x) = 



n dj(x) 

7=1 

K 



(4) 



E n dj{x) 

1 = 1 7 #1 



In particular, for K = 2, 



Pi(x) = 



di{x) 

d\(x) + diixY 
D(x) = 



Piix) = 



d\{x) 



d\{x) + <i 2 (x)’ 
d\(x)d2{x) 



and 



di(x) + fi? 2 (x) 



(5) 



The function D(x), called the joint distance function (abbreviated JDF) at x, is 
(up to a constant) the harmonic mean of the distances {d\ (x), . . . , djf (x)}. The JDF 
is a continuous function that captures the data points in its lower level sets, a prop- 
erty called contour approximation, see Arav (2008), lyigun and Ben-Israel (2009). 
Indeed, the geometry of each cluster is represented by its distance function (2), and 
the overall shape of the data set is given by the harmonic mean of these distances. 
The contour approximation of data by the JDF is illustrated in Fig. 1, for data sets 
with two and three clusters. 

The JDF also gives a compact representation of the data in question; To represent 
a data set with N data points in K", arranged in K clusters, the JDF requires K 
centers and K covariance matrices, a total of K parameters, a considerable 

saving if K N . 

An ecological forerunner of the JDF and contour approximation is the home 
range, the territory of a species, given in Dixon and Chapman (1980) in terms of 
the harmonic mean of area moments, a finding confirmed since then for hundreds of 
species. 
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-1 0 1 2 3 4 5 -2 -1 012345 

(a) A data set with 2 clusters (b) A data set with 3 clusters 

Fig. 1 The lower level sets of the JDF capture the data points 



2,5 The Classification Uncertainty Function 



The JDF has the dimension of distance. Normalizing it, we get the dimensionless 
function 

l/K 



E(x) = K D{x)/ j ]~[ti?; (x) 
W = i 



(6) 



with 0/0 interpreted as zero. E{x) is the harmonic mean of the distances {dj{x)} 
divided hy their geometric mean. It follows that 0 < £(x) < 1, with E{x) = 0 
if any dj(x) = 0, i.e., if x is a cluster center, and E{x) = 1 if and only if the 
probabilities Pj{x) are all equal. 

E (x) can be written, using (4), as the geometric mean of the probabilities (up to 
a constant), 

l/K 



E{x ) = ^ ( n P}{^) 



In particular, for K = 2, 



j,, , „ yrfi(x)fi?2(x) — — 

2 ^ / X , ^ / X = 2 vri(x)r2(x). 
di(x) + d 2 {x) 



(7) 



( 8 ) 



In the case ^ = 1 , where the whole data set is taken as one cluster, we get formally 
from (7), 

E(x) = \. (9) 



The function E{x) represents the uncertainty of classifying the point x, see 
Appendix 2. We call E(x) the classification uncertainty function, abbreviated 
CUE, at X. 
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The CUF of the data set T> = {x, : i € I, N} is debited as 

1 ^ 

E{V) := ^ E (10) 

! = 1 

E (V) is a monotone decreasing function of K, the number of clusters, decreasing 
from E(V) = 1 [for AT = 1, see (9)], to E(V) = 0 (for K ^ N, the trivial 
case where every data point is a separate cluster). The rate of decrease of E (T>) is a 
natural criterion for determining the “right” number of clusters, if it is not given. 



2.6 An Extremum Problem for the Cluster Probabilities 
at a Point 

Given the distances {dk(x)}, and considering the probabilities {pt} as variables 
(abbreviating pk{x) by pk), we note that (3) is the optimality condition of the 
extremum problem 



K 

Minimize ^ ^dk{x)pl (11) 

k=\ 

K 

subject to pk = i. and pk>0, k ^ i,K. 

k=l 

Indeed, the Lagrangian of this problem is 

K / K 

L{pu...,pK,X) = jE dk('>t)pl-)^ IE 1 

i:=l \yt=l 

and zeroing the partial derivatives (with respect to pk) gives pk dk(x) = A, which 
is (3). 

The squares of probabilities in (11) serve to smooth the underlying optimiza- 
tion problem which is nonsmooth, see the seminal paper Teboulle (2007) for other 
smoothing schemes, and a modern optimization framework for clustering. 



2,7 An Extremum Problem for Clustering the Data Set 

The optimization problem (P.O) of Sect. 1.1, for clustering a data set T> = {x, : i e 
1,A^} into K clusters, is written in detail as 
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N K 

Minimize 5 ^ ^ 4 (Xi , C<;) Pk (x; f (P.O) 

( = 1 k=\ 

K 

subject to Pk(xi) = 1, PkO^i) >0, Vie 1,A^, k e l,K, 

k=\ 

with the probabilities p and centers c as variables. 



2.8 An Outline of the Probabilistic Distance Clustering 
Algorithm of Ben-Israel and lyigun (2008) 

The algorithm of Ben-Israel and lyigun (2008) solves the above problem (P.O) by 
iteratively updating the probabilities and centers. 

Given the data set T> and the number K of clusters, the algorithm begins with K 
arbitrary centers Ck- In each iteration, the probabilities are computed by (4), for the 
given centers and distances. Fixing these probabilities, the centers Ck are updated as 
convex combinations of the data points x, , 



^ Pki^if 

Ck = Y] X,-, with weights Xki = klk{x^) ^ ; e (12) 

N pkjxjY 

j=i dk {xj ) 

The update (12) is obtained by differentiating the objective function in (P.O), and 
zeroing the gradient. 

The iterations stop when the centers “stop moving”. 

If the Mahalanobis distance is used, the covariance matrices taken initially as 
Si = /, and are recomputed using the current centers Ck and probabilities. 

This algorithm of Ben-Israel and lyigun (2008) was adapted to account for the 
cluster sizes in lyigun and Ben-Israel (2008). 

Notes 

(a) Iteration (12) is a generalization to several centers of the Weiszfeld method 
(Weiszfeld, 1937) for solving the Fermat-Weber location problem, and can be 
used for solving multi-facility location problems. 

(b) A theoretical issue is that the gradient of the objective function of (P.O) is unde- 
fined if one of the data points {x, } coincides with one of the current centers {Ck}. 
In this case the gradient can be modified, as in Kuhn (1967) and Kuhn (1973), 
to guarantee that the method converges for all but a denumerable set of initial 
centers. 

(c) The algorithm of Ben-Israel and lyigun (2008) is robust: cluster centers are 
insensitive to outliers, that are discounted because the weights in ( 12 ) are 
inversely proportional to the distances. 
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3 Prior Information and Classification 
3.1 Probabilistic Labels 

Let I? be a data set with N points {x,- : i e and K clusters {Ck '■ k e 1,^}. 

We assume that prior information is given for each point x, , as probabilities 
rk{Xi) that X, belongs to Ck , k € 1,^- These allow for rigid constraints such as 
= 1 = f2(x4), saying that X3 and X4 both belong to € 2 - 
If the story ends here, the membership probabilities Pk(Xi) are taken equal to the 
probabilistic labels, 

Pk(Xi) = rk{Xi), Y k,i. (13) 



3.2 An Extremum Problem for Classification 

A (trivial) extremum problem resulting in (13) is 

N K 

Minimize 5 X! XI -^k){Pk (x,- ) - rk (xj)f (P. 1 ) 

i = l k=l 
K 

subject to X] Pk{xi) = 1, Pk{xj) >0, Vie 1,A^, k e 1,-^f- 
k=l 

which is taken as the problem (P.l) of Sect. 1.2. The distances dk(xi,Ck) in the 
objective function serve to give it a dimension of distance, which allows combining 
(P.l) and (PO). 



4 Semi-supervised Distance Clustering 

4.1 An Extremum Problem for Semi-supervised Clustering 



We propose combining the clustering and classification problems in a parametric 
model, using a parameter 0 e [0, 1] for the weight given to the prior information. 
The model uses an optimization problem that is a convex combination of (PO) and 

(P.l), 



N K 



Minimize 5 X X ^kiXi.Ck) [(1-0) Pk{Xif + 0 (pk{xi) - rk{xi)f 

i = l k=l 
K 

X W(Xi) = 1, PkiXi) > 0, 



(P.0) 



yt=l 



subject to 



V i G \,N, k G \.K. 
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This formulation gives a continuum of problems, with the clustering problem (RO) 
and the classification problem (P.l) as special cases. 

For fixed centers {c<;} and distances {dk{xi. c<;)}, the problem (P.0) is separable, 
reducing to problems, one for each point X, , i e 1,A^. 



4.2 Probabilities 



To simplify notation, consider the case of two clusters, and a single data point x 
(since the problem (P.0) is separable). The distances dk{x), probabilities pk(^) and 
labels ^^.(x) are abbreviated below by dk, Pk and Vk respectively. 

Given d\.d 2 and V \ , V 2 , the problem (P.0) becomes. 



min I 1^(1 - 0 ) {dy p\ + d?2 p\) + 0 [dy (py - ryf + d2 {P2 - r2f) 

s.t. pi+ p2= 1 , 

p\, P2> 0. 



(14) 



The Lagrangian of this problem is 

L{px,p2,X) =i 1^(1 - 0) [di p\ + d2 pI) + 0 {di {px - rxf + d2(p2 ~ r2f) 

- X{px + P2- 1 ). 

Zeroing the gradient (with respect to px , P 2 ), and using the fact that the probabilities 
add to one, we get 



Px = (1-0) 



d2 



dx d2 



+ 0rx. 



P2 = 0 ) 



dx 



dx d?2 



+ 0f2, (15) 



giving the probabilities as convex combinations of the clustering probabilities (5) 
and the labels rx,r 2 - 



4.3 Cluster Centers 

Given a data set T> = {x, : i e 1,A^}, identified for simplicity with the training set, 
and fixing the probabilities {px (x, ), P 2 i^i) as in (15)], the extremal problem (P.0) is 

N 

min 5 ['■ ^^1 (Xj , Cl ) (Xj ) d2i^i , C 2 ) ^ 

z = l 

N 

+ 0 ^ (dl(X;,Cl) (;tl(Xi) - ri(x ,))2 + d2(x;, C2) (;?2(x,) - r 2 (x;)) 2 ^j 
/=1 



( 16 ) 
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For the elliptic distance (1), the gradient of the objective function in (16) w.r.t. Ci is 



-Vc, 



N 

i=i 



Qijxj - Cl) 
rfl(X;,Cl) 



N 

Z=1 



gi(x,- -Ci) -| 

t(i(x,-,ci) J 



Zeroing the gradient, and canceling the nonsingular matrix gi, we can express the 
center Ci as a convex combination of the data points {x, : i e 1,A^}. Repeating for 
the center Ca, we can summarize 



N 

Ck = J2^kiXi, k=\,2, (17) 

1=1 

where the weights Xti are given by 



Xki = 



UkiXi) 

N 

L ukixj) 

2=1 



, with 



r t n W(x,)2 (W(xi) - rk(xi)f 

ukixi) = (1 - 6)— ^ + e 



dk(xi,Ck) 



dk(xi,Ck) 



1 , 2 . 



(18) 



The coefficients Uk(xi) in (18) depend on the parameter 9. The limits of the 
coefficient mi(x, ) in the extreme cases 6=0 and 1 are 




Analogous results apply to the coefficient MaCx, ). 



4.4 Algorithm 

The above ideas are implemented in an algorithm for semi-supervised distance clus- 
tering of data. A schematic description, presented - for simplicity - for the case of 
two clusters, follows. 
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Algorithm 1 Semi-supervised distance clustering 



Initialization: 


given data T>, any two points Ci, Ca, covariances Si = Sa = /, 
a value 0, and e > 0 


Iteration: 




Step 1 


compute distances d\ (x), dtix) for allx &T> 


Step 2 


compute probabilities p\ix), />a(x), using (15) for all x &T> 


Step 3 


update the centers , c^, using (17 )-( 18) 


Step 4 


compute the cluster covariances S i , Sa 




using the current centers and probabilities 


Step 5 


'Z ^i" “ Cl -1- — Cal < e stop 

return to step 1 



The algorithm solves the problem (P.6I) and reduces for 0 = 0 to the probabilistic 
distance clustering algorithm of Ben-Israel and lyigun (2008). Step 4 is needed if 
Mahalanobis distances are used, and is absent otherwise. 



5 Examples 

Recall that a data set has well-matching labels if the results of clustering are insen- 
sitive to the parameter Q, and ill-matching labels otherwise. We illustrate this for a 
synthetic data set, Example 2, with ill-matching labels, and two medical data sets in 
Example 3. 

Example 2. Figure 2a shows a data set V in with N = 200 data points in two 
equal clusters. The labels of these points are shown in different colors in Fig. 2b. 
These labels are clearly in conflict with the intrinsic clusters. 



1 

0.8 

0.6 

0.4 

0.2 

0 

- 0.2 
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- 0.6 

- 0.8 

-1 



(a) The data points (b) The data points and their labels 




Fig. 2 Illustration of Example 2 
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(c) Contours of E{x) for 6 = 1 .00 



(d) E(D) as a function of 6 



Fig. 3 The CUF of Example 2 for different 6 values 



Figures 3a-c show level sets of the CUF E{x) of (8) for different values of 9. 
These were computed using results obtained by Algorithm 1 . Darker colors indicate 
higher values of E{x), and greater uncertainty in classification. 

For 0 = 0, see Fig. 3a, the labels are ignored, and the data set is partitioned 
following its intrinsic clusters as in Fig. 2a. The vertical white line in the center is 
the locus of equal probabilities />i(x) = /> 2 (x) = 0.5. This line, that coincides here 
with the Fisher linear discriminant, can serve as a classification rule for assigning 
points to the two intrinsic clusters. 

For 6 = 0.25, see Fig. 3b, the level sets of E{x) are evolved to take account of 
the prior information. The locus of equal probabilities />i(x) = /> 2 (x), which can 
serve as a classification rule, is again shown in white. Figure 3c shows the level 
sets of E(x) for 0=1, i.e., where only the prior information is considered.' The 
equiprob ability locus is here the horizontal white line, contrasting with the vertical 
line in Fig. 3a. 

Figure 3d displays E{V), the CUF of the data set V = {x, : i e l,iV}, see (10), 
for different values of 0. For 0=1 the uncertainty is zero, since the probabilities 
are given by the binary labels, see (15). That E(V) does not decrease monotonically 



’ The level sets shown have low values of E(x), reflecting no uncertainty of classification, and the 
colors would all be white or near white if the color scale was the same as in the previous figures. 
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e 

(a) Wisconsin breast cancer data 




Fig. 4 Examples of well-matching, and ill-matching, labels 



as 9 increases, is due to the conflict between the intrinsic clusters in Fig. 2a, and 
the prior information in Fig. 2b. A mixture of these two models may have greater 
uncertainty than the “pure” models (RO) and (P.l). 

Example 3. We consider two well-known data sets, given in Merz and Murphy 
(1996). For each data set, the cluster centers and classification rules were computed 
for different values of 9, and the percentages of correct classifications are plotted 
in Figs. 4a, b (the thick curves). The thin curves are the graphs of the CUF EiV), 
which decreases to zero as 0 — ^ 1 . 

Figure 4a concerns the Wisconsin breast cancer data set, shown to have well- 
matching labels. (The percentage of correct classifications is insensitive to 9.) This 
set would be clustered correctly even without the prior information, that is needed 
only to put the right labels on the clusters. This explains why all 33 methods reported 
in Lim et al. (2000) give excellent results for this set. The CUF E {V) is monotone 
decreasing since there is no conflict between the intrinsic clusters and the labels. 

Figure 4b illustrates the diabetes data set, Merz and Murphy (1996), shown to 
have ill-matching labels. The percentages of correct classifications are sensitive to 
the parameter 9, and the CUF E{T>) is non-mono tonic. 
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Appendix 1: The Membership Probabilities 

In this appendix, stands for dk(^, Ck), the distance of x to the center Ct of the 
k th-cluster, k & I, K . 

The cluster membership probabilities {/><r(x) : k e l,.Si} of a point x depend 
Qt\\y on Iht distances {dk(ti) : k e 1,W}, 



p(x) = f(d(x)), 



(19) 
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where p(x) e is the vector of probabilities (pi(x)), and d(x) is the vector of 
distances Natural assumptions for the relation (19) include 



Condition (20a) states that membership in a cluster is more probable the closer it is, 
which is Assumption (A) of Sect. 2.3. The meaning of (20b) is that the probabilities 
Pki'^) do not depend on the scale of measurement, i.e., the function f is homoge- 
neous of degree 0. It follows that the probabilities /’^-(x) depend only on the ratios 
of the distances {fi?i(x) : k e 1,.K1}. 

The symmetry of f, expressed by (20c), guarantees for each k e I, K, that the 
probability pjt(x) does not depend on the numbering of the other clusters. 

Assuming continuity of f it follows from (20a) that 



for any i, j e 1, .SI. In particular, the probabilities />jt(x) are all equal only if so are 
the distances dk(^)- 

For any nonempty subset 5 C 1 , ^, let 



the probability that x belongs to one of the clusters {Cs : s e S}, and let 
denote the conditional probability that x belongs to the cluster Ct, given that it 
belongs to one of the clusters {Cs s e S}. 

Since the probabilities />t(x) depend only on the ratios of the distances {dk{x) : 
k e 1 }, and these ratios are unchanged in subsets S of the index set IK, it follows 

that for all k e YTK. 0 ^ 5 C hK, 



which is the choice axiom of Luce (1959, Axiom 1), and therefore, Yellott (2001), 



fi?i(x) < dj(x) Piix) > Pj{x), for all/, 7 &\,K, 
f(A d(x)) = f(d(x)), for any A > 0, 

Q p(x) = t(Q d(x)), for any permutation matrices Q. 



(20a) 

(20b) 

(20c) 



di(x) = dj{x) pi(x) = pj{x), 



Ps{x) = 



Pk(x) = w(x|5) ps(x) 



( 21 ) 




( 22 ) 



where Vi(x) is a scale function, in particular. 



Vk(x) 



E ' 



j€1,A’ 



Pk(x) = 



(23) 
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Assuming Vi(x) ^ 0 for all k, it follows that 



Pk(x)vk(x) ' 



1 

^ V,(x)’ 



(24) 



where the right hand side is a function of x, and does not depend on k. 

Property (20a) implies that the function Vk(-) is monotone decreasing. A simple 
choice is 



VA:(X) 



: 

4(x)’ 



(25) 



for which (24) gives 



Pk{x)dk(x) = 




= D{x), 



(26) 



in agreement with (3)-(4). 



Appendix 2: The Classification Uncertainty Function 



Let be the set of ^-dimensional probability vectors, denoted p =(/>;), q = 
{qi). Given a convex function <p : R+ ^ R, the Csiszdr (p-divergence, Csiszar 
(1978), defined by 

with (27) 

is a distance function on P^, a generalized measure of entropy, Aczel (1984), 
whose distance-like properties make it useful in stochastic optimization (Ben-Tal 
& Teboulle, 1987; Ben-Tal, Ben-Israel, & Teboulle, 1991). For the special case 

0KL(O:=?logL t>0, 



(27) gives 



K 

W(P-q) = X! P' 

! = 1 




the Kullback-Leibler distance (Kullback, 1959; Kullback & Leibler, 1951). Rewrit- 
ing the CUF (7) as 



E(x) = 




l/K 
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and taking logarithms, we get 



-log£(x)= (l/^)log 

i€TK 







(28) 



the Kullhack-Leibler distance between the distributions 



P(x) = (;?i(x), P 2 (^), ■ ■ ■ , PkOO) and i 1 = (i, , i). 

The latter distribution, ^ 1> is of maximal uncertainty in P^, and consequently the 
divergence (;f P(x)) is a measure of the uncertainty of the distribution p(x), 

with smaller values corresponding to greater uncertainty. 

Writing (28) as 

£(x) = exp{-/^^(il,p(x))} (29) 

it follows that E{x) is an entropic measure of the uncertainty of classification, a 
monotone increasing function of the uncertainty. 
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Strategies of Model Construction 
for the Analysis of Judgment Data 



Sabine Krolak-Schwerdt 



Abstract This paper is concerned with the types of models researchers use to 
analyze empirical data in the domain of social judgments and decisions. Models 
for the analysis of judgment data may be divided into two classes depending on 
the criteria they optimize: Optimizing an internal (mathematical) criterion function 
with the aim to minimize the discrepancy of values predicted by the model from 
obtained data or incorporating a substantive underlying theory into the model where 
model parameters are not only formally defined, but represent specified components 
of judgments. Results from applying models from both classes to empirical data 
exhibit considerable differences between the models in construct validity, but not in 
empirical validity. It may be concluded that any model for the analysis of judgment 
data implies the selection of a formal theory about judgments. Hence, optimizing a 
mathematical criterion function does not induce a non-theoretical rationale or neu- 
tral tool. As a consequence, models satisfying construct validity seem superior in 
the domain of judgments and decisions. 

Keywords Model comparison • Models of data analysis • Social judgments ■ 
Validity. 



1 Introduction 

This paper is concerned with the types of models researchers develop and use to 
analyze empirical data in the domain of social judgments and decisions. 

Social judgments play a central role in the professional as well as private every- 
day life. Social judgments are a key prerequisite for coordinated social life, and 
the ability to integrate complex social information for judgment purposes is one 
of the most demanding tasks (Fiske & Taylor, 2008). Frequently, these judgments 
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contribute to far reaching decisions about persons. Examples are medical expert 
judgments, court decisions or decisions about job applicants. 

The combination of characteristics or attributes of the object to be judged into 
a composite, which represents the judgment, is a pervasive and important problem 
in nearly all kind of decision making situations (Einhorn & Hogarth, 1975). Eor 
example, it may be considered how symptoms are to be weighted and combined 
into a clinical judgment about the disease of a person. These problems subsume 
the following main aspects of the combination issue: (1) specifying the function 
that relates the attributes to the composite and (2) determining the weight of each 
attribute to the composite. 

A great number of investigations have approached the measurement of judgments 
and possible biases or flaws in judgments. Typically, these are faced with the follow- 
ing problem. On the one hand, there are substantive cognitive theories and empirical 
results on the nature of judgments. On the other hand, the choice of a method to 
analyze judgment data implies the selection of a formal theory about judgments. 
Hence, the method has the same function of model building as the cognitive theory. 
Both have to coincide, otherwise artefacts are obtained instead of valid results. The 
term “method” refers to a statistical method to analyze the data or a model of data 
analysis. The question may be raised how an adequate method may be constructed. 



2 Strategies of Model Construction 

Corresponding methods may be divided at least into two classes depending on the 
criteria they optimize (Apostel, 1961; Roskam, 1979). The hrst is optimizing a 
mathematical criterion function. The aim is to minimize the discrepancy of val- 
ues predicted by the model from obtained data. Erequently, least squares procedures 
are used. An example is multiple regression with the model equation y, = a + 
Pj Xij -he. The interesting model parameters are the regression weights Pj which 
are estimated to fulhl the least squares criterion ^, (yi — a — PjXij)^ := ;n/n. 
The criterion which has to be satisfied by a valid model of the obtained data is empir- 
ical validity. Thus, empirical validity involves that the model should fit the obtained 
data and it is usually assessed by an overall goodness-of-fit measure. A measure that 
is frequently used is the correlation R between the obtained data and the predicted 
data or its square which specifies the amount of variation in the data predicted 
by the model (cf. Harshman, 1984). 

In the second approach, the criterion which has to be satisfied is construct valid- 
ity. Construct validity refers to the ability of the model parameters to reflect the 
judgment structures that they are specified to represent. The corresponding type 
of model construction involves that a substantive theory is incorporated into the 
model. An example from psychophysics concerns the modelling of taste impres- 
sions. According to the psychophysical law k = C"t, a specific taste k depends on 
the concentration C of a tasted liquid and duration t of exposition. The exponent n is 
different for tastes like sweet, bitter or sour. Thus, the model parameter n is not just 
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formal, but represents a specific taste. In other words, it has construct validity. Item 
response theory from the domain of psychological assessment is another instance, 
where model parameters reflect characteristics of items and of persons responding 
to the items of a test (Fischer, 1996). 

The basic thesis of this paper is that the second approach involving construct 
validity should be advanced within the domain of judgments and decisions. Due to 
their construction, models satisfying construct validity have the potential to inte- 
grate substantive theories on the nature of judgments and methods of data analysis. 
In order to substantiate the superiority of the second approach, the following expo- 
sition introduces selected theories and findings on social judgments from two broad 
research lines first. Subsequently, it is outlined how these relate to model parame- 
ters within the two model classes introduced above by use of empirical data. Finally, 
implications of the findings on principles of model construction are discussed. 



3 Theories of Judgment and Empirical Findings 

The first research line concerns the way people integrate pieces of information for 
judgment purposes. Judgments consist of gradations along a number of dimensions 
such as valence or agreeableness of persons (e.g., judgments about the degree a per- 
son is friendly or unfriendly, idealistic or materialistic, talented or dull) (Anderson 
& Sedikides, 1991). In a number of judgment conditions people make judgments 
based on all of the relevant information, weighted and combined into a dimension 
by an algebraic integration principle (Anderson, 1981; Fiske & Taylor, 2008). This 
principle has been stated for the first time by Benjamin Franklin (cited from Dawes 
& Corrigan, 1974, p. 95): 

My way is to divide half a sheet of paper by a line into two columns; writing over the one 
Pro, and over the other Con. Then, ... I put down under the different heads short hints ... for 
or against the measure. When I have thus got them all together in one view, I endeavor to 
estimate the respective weights ... to find at length where the balance lies. 

The principle may be stated formally as y, = bjXjj and is nowadays 
known as Franklin’s rule (cf. Gigerenzer & Todd, 1999). Within this basic ratio- 
nale, empirical results have shown that people use weights bj of H-1 and —1 to form 
the judgment, termed the unit weighting principle (Brdder, 2002; Dana & Dawes, 
2004). Thus, people simply add information with positive evidence for the judgment 
(i.e.. Pros) and subtract information with negative evidence (i.e., Cons). 

The second research line is concerned with the effects that existing knowledge 
structures in memory have on judgments. In the social domain, such knowledge 
structures comprise categories or stereotypes. Stereotypes are cognitive structures 
that contain peoples’ knowledge, beliefs and expectations about social groups (e.g., 
Fiske & Taylor, 2008). They involve illusory correlations of category membership 
and specific attribute domains (Hamilton & Gifford, 1976). Thus, they create con- 
nections between judgment dimensions which are statistically independent. As an 
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example, in the stereotype of a “skinhead” agreeableness and dominance are nega- 
tively correlated. Thus, a person categorized as a skinhead is judged as not agreeable 
and dominant. Hence, stereotypes bias judgments by causing such correlations. 

The following exposition outlines both research domains in detail before return- 
ing to the question of model construction. 



3.1 The Problem of Information Weighting 

As to the question how information is weighted and combined into a judgment, 
behavioral decision research is confronted with the problem of drawing conclu- 
sions about unobservable decision strategies from behavioral data. Strategies like 
the unit weighting principle or the standard multiple regression model are com- 
peting theories about information integration in judgment and decision tasks. The 
design of studies which have the aim to draw corresponding conclusions consists of 
regressing judgment data as criterion values on the presented pieces of information 
as a set of predictors by either strategy and subsequently comparing the variances 
accounted for. 

Empirical evidence which shows that the unit weighting principle is superior to 
the regression model with optimal regression weights in approximating human judg- 
ments comes from experimental and field studies within functional measurement 
theory (Anderson, 1981), social judgment theory (Stewart, 1988), simulation studies 
(Broder, 2002) and some other domains. In these studies, unit weighting models are 
superior in the sense that they are more parsimonious than regression with optimal 
weights, but have comparable empirical validity. Consequently, in order to formulate 
an adequate, that is, a frugal model for the analysis of judgment data, the weights 
may be restricted to unity without much loss of information. 

Even more intriguing is the fact that unit weighting models correlate highly and 
in a number of studies nearly perfectly with the predictions from standard regression 
analysis. Stated in other words, it has been repeatedly demonstrated that the unit 
weighting strategy is fairly accurate as compared to regression models with optimal 
regression weights (Broder, 2002; Dawes & Corrigan, 1974; Wainer, 1976). 

Einally, and most importantly, unit weighting yields a valid prediction of a 
known, true criterion. Far-reaching hndings were presented by Dawes and Corrigan 
(1974). In a number of studies, they have used large empirical data sets from clinical 
psychology, education and perception. In the following, their procedures and results 
will be very briefly outlined. 

In Study 1 , first-year graduate students in the department for psychology at the 
University of Illinois were evaluated on 10 variables which were predictive of aca- 
demic success. These variables included aptitude test scores, college grade point 
average, peer-ratings on extroversion and self-ratings on conscientiousness. A Grad- 
uate record exam (GPA) was computed for all these students. This served as the 
external validity criterion. The aim of the study was to predict the GPA results from 
the 10 variables. 
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In Study 2, graduate students in the department of psychology at the University 
of Oregon, who had been there for at least two years, were evaluated on a five-point- 
rating scale by faculty members who knew them well. This was one external validity 
criterion. The other criterion was the final Graduate record Exam (GPA). 

At the time the students applied, three scores were obtained for each student: 
His or her Undergraduate record Exam (GRE), undergraduate grade point average 
and a score of the quality of the institution at which the undergraduate exam has 
been passed. These scores were available to the admission committee at the time 
the students applied and they served as predictors. The problem was twofold: (1) To 
predict the final Graduate record exam (GPA) from these three variables and (2) to 
predict the ratings of the faculty members from these variables. 

In Study 3, which was an experiment on perception, participants received ellipses 
which were varied on the basis of each figure’s size i , eccentricity j , and grayness k. 
The formula for variation used by the experimenters wasij + jk + ik. Participants’ 
task was to estimate the value of each ellipse. The external validity criterion was 
the true (that is, experimenter assigned) value of each ellipse on the basis of its size, 
eccentricity, and grayness. 

In all of these studies, the data analysis was the following. The problem was 
always to predict the external validity criterion. Predictors were integrated for the 
prediction by several models. One was by use of estimating optimal beta weights 
in a standard regression analysis and another one was by use of a unit weighting 
to integrate the predictors. Table 1 shows the validity results from applying both 
models. 

Results from the experimental Study 3 show identical validity coefficients for 
both models (see Table 1). In the other studies, validity of the optimal linear model 
is increased as compared to the unit weighting scheme, but only slightly increased. 
Thus, the difference in validity coefficients between the two models does not really 
matter. The conclusion drawn from this and many other studies is that researchers in 
the domain of judgments should not bother about an optimal model at all (Broder, 
2002; Dawes & Corrigan, 1974; Wainer, 1976). Obviously, unit weights are as pre- 
dictive as regression procedures (e.g., Einhorn & Hogarth, 1975; Schmidt, 1972; 
Claudy, 1972). 

The obvious next question is then: What are the reasons for the unit weight- 
ing model to be an adequate approximation for human decision behavior and to 



Table 1 Correlations between predictors and external criterion values from the Dawes and 
Corrigan study (Dawes & Corrigan, 1974) 



Study 


Validity of 


Validity of 




unit weigh- 


optimal li- 




ting model 


near model 


1: Illinois students’ predictions of GPA 


0.60 


0.69 


2: Oregon students’ predictions of GPA 


0.60 


0.69 


2': Oregon faculty members’ ratings 


0.48 


0.54 


3: Predictions of ellipses 


0.97 


0.97 
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be externally valid? At least three reasons have been discussed (see Einhorn & 
Hogarth, 1975 for a review): (1) The weighting problem is subsidiary to specify- 
ing the relevant variables which should be put into the model. That is, once the 
relevant variables are included in the model, their weighting may not be very impor- 
tant (cf. Dawes & Corrigan, 1974). (2) In estimating optimal regression weights, 
there will inevitably be sampling error. In contrast, unit weights have no sampling 
error. Hence, there may be a potential trade-off between accuracy of estimation and 
estimation without error (or nearly without error). As Einhorn and Hogarth (1975, 
p. 173) put it, “because judgment data will contain both sampling and measurement 
error, the relative superiority of regression procedures over unit weighting may be 
quite small (or nonexistent)”. 

Thus, unit weighting models are adequate in the sense that they reflect compo- 
nents of human judgment processes in a more parsimonious way and in the sense 
that the prediction is as externally valid as those of standard regression. As opposed 
to Eranklin’s rule, which represents the standard regression model, unit weighting is 
nowadays well known as Dawes’ rule (Gigerenzer & Todd, 1999). 



3.2 Illusory Correlations in Judgments 

Dawes’ rule turns out as valid in conditions, where people make judgments based 
on all of the relevant information which is then integrated attribute by attribute. 
This holds true for all conditions where judgments have significant consequences 
and thus make people accountable for their decisions or when people have enough 
processing capacity at their disposal to revisit all the given information (Eiske & 
Taylor, 2008). 

Under other conditions, however, people cannot afford the large processing abil- 
ities which are assumed by fully integrating the information pieces or they are not 
motivated to do so. In these cases, people use heuristic strategies, which are much 
simpler, but still represent viable alternatives (Gigerenzer & Todd, 1999). There 
are a number of heuristics which cannot be discussed in the present context due 
to space limitations. However, one prevalent strategy is to base one’s judgments 
on stereotypes which involve illusory correlations of attribute domains which are 
indeed statistically independent (Hamilton & Gifford, 1976). 

The cognitive basis is that people overestimate the frequency of co-occurrence 
of events which are statistically infrequent: If one group of persons occurs less fre- 
quently than another and one type of behaviors occurs infrequently, then observers 
overestimate the frequency that this type of behavior was performed by members of 
that group. 

In an experiment, Hamilton and Gifford (1976) presented statements about mem- 
bers of two groups, which were simply labeled Group A and Group B. The stimulus 
set contained twice as many statements about Group A as about Group B. Eur- 
thermore, the behavioral statements were desirable or undesirable, with twice as 
many desirable as undesirable behaviors in each group. Because the proportions of 
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Table 2 Experiment of Hamilton and Gifford (1976) 





Frequency of sti- 
mulus sentences: 
Group 


Frequency esti- 
mates means: 
Group 




A 


B 


A 


B 


Behaviors: 

Desirable 


18 


9 


17.1 


7.2 


Undesirable 


8 


4 


8.9 


5.8 


E 


26 


13 


26.0 


13.0 



desirable and undesirable statements were identical for the two groups, there is no 
correlation between group membership and desirability in the stimulus set. Partic- 
ipants had to estimate the number of wndesirahle behaviors in each group. Table 2 
shows the frequency of the statements and the estimates of the participants. 

The results show an overproportionally high estimate of undesirable behaviors 
for Group B as compared to Group A. Phi coefficients between group member- 
ship and desirability were significant which indicates an erroneous perception of an 
association between the smaller Group B and undesirable behaviors (Hamilton & 
Gifford, 1976). 

Having revisited two important research domains on social judgments, the fol- 
lowing exposition focuses on the question, if there are models of data analysis which 
may incorporate this way of theorizing. 

If construct validity is the criterion to be optimized in the construction of a model 
of data analysis, then an adequate formal approach must incorporate the follow- 
ing components: (1) A formal model must include the unit weighting principle 
to integrate person information. (2) The model has to specify parameters for the 
correlations of judgment dimensions in different conditions. 

An appropriate model class might be three-way two-mode multidimensional 
scaling. These models have the potential to reflect the occurrence of correlated or 
independent judgment dimensions due to stereotype use conditions. 



4 Three-Way Two-Mode Models 



Three-way two-mode scaling models may be subdivided into the two model classes 
that were introduced at the outset. Thus, there is one class optimizing an internal 
(mathematical) criterion function. An example is the Tucker model (Tucker, 1972). 
In contrast, a model which was derived to satisfy construct validity is the SUMM-ID 
approach (Krolak-Schwerdt, 2005). What the models have in common, are the input 
data and the basic model equation for the data. In the following, the scalar product 
form of the models will be outlined. 
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The input data consist of a three-way data matrix X = {Xijj’), i = 1, . . . , / , 
j, j' = , J , where I is the number of individuals or conditions and J the 

number of attributes. X can be thought of as comprising a set of 7(>2)/ x J scalar 
products matrices. Xi , a slice of the three-way matrix, consists of scalar products 
between attributes j, j' for an individual or a condition i. 

The basic model equation can be expressed as Xi = BHiB' -|- Ej, where B 
is a J X P matrix specifying an attribute space or judgment configuration which 
is common to all individuals or conditions where P is the number of dimensions. 
Hi is a P X P symmetric matrix designating the nature of individual i's repre- 
sentation of the judgment dimensions. Diagonal elements hipp of 77, correspond to 
weights applied to the judgment dimensions by individual i , while off-diagonal ele- 
ments hipp' are related to perceived relationships among the judgment dimensions 
p and p' . Matrix 77, , termed core matrix (Tucker, 1972), transforms the common 
judgment space into the individual representation, and 7s, collects the errors of 
approximation eijj'. 

Thus, the basic model equation assumes that there is a common space repre- 
sented by matrix B which underlies judgments in general. On the basis of the 
common space, the model allows for two kinds of distortions in individual rep- 
resentations. The first is that individuals may attach different weights to different 
judgment dimensions. More important in the present context is the second type of 
distortion: Individual representations may be rotated versions of the common space 
in which independent dimensions become correlated. 

To return to the two model classes, there are a number of differences between 
them, but in the present context the most important may be sketched in the following 
way: Models optimizing an internal criterion function such as the Tucker model 
determine the parameter matrices such that the discrepancy between obtained and 
predicted data will be minimized, —b jpbj'p'hipp')^ := min. This 

is accomplished by a principle component analysis in the attributes’ mode or by 
an alternating least squares approach. The important fact for the present research 
question is that any real valued estimates for the entries of B are considered as 
long as the discrepancy function is minimum. In terms of the judgment process this 
implies Franklin’s rule. 

In contrast, SUMM-ID integrates the unit weighting principle. To sketch the 
underlying rationale very briefly, sign vectors Zp for the attributes j , zjp e {—1, 1}, 
and, in an analogous way, sign vectors Sp for the individuals or conditions 
i. Sip e {-1, 1} are introduced, where = tp := 

max. An estimate of B = {bjp) is obtained by bjp = SipZj'pXijj'tp ' . 

For a more thorough discussion of the model, the reader may be referred to 
Krolak-Schwerdt (2005). 

Results from applying both approaches, the Tucker model and SUMM-ID, may 
be fundamentally different as to construct validity of the model parameters. This 
will be demonstrated in the following by applying both approaches to experimental 
data. 
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In an experiment on the topic of teachers’ achievement judgments of students’ per- 
formance, experienced school teachers as participants received case reports about 
students as materials. Each case report contained information on social activity, dis- 
cipline, capability and motivation of the student. Teachers were told to form an 
impression of the student as in usual classroom assessments. The experiment had a 
factorial design where the hrst factor was activation of a stereotype. That is, in one 
of two experimental conditions a stereotype (e.g., “the student is a bloomer”) was 
activated prior to the presentation of the case report, while the other condition pro- 
ceeded without stereotype activation (termed “non- stereotype” in the following). 
The second factor was replication with two other descriptions presented with or 
without stereotype activation. 

After having read one of the descriptions, subjects had to rate the case report 
on seven rating scales such as capability of achievement, work ethics and so on. 
These scales correspond to the following dimensions: (1) social competence, (2) 
reasoning, (3) language capability. 

For each case report, the normalized distance matrix between the scales was used 
as input for the data analysis. These data were then subjected to SUMM-ID and the 
Tucker approach. We expected the following results: A common judgment space 
should occur which consists of the three a-priori dimensions just mentioned. Illusory 
correlations should appear in increased off-diagonal values of the core matrix in the 
two stereotype conditions. In the other conditions, these values should be near zero. 



4.1.1 Results 

As to explained variation in the data, both models showed an excellent recovery 
of the data. The Tucker approach with 94% is slightly superior to SUMM-ID with 
92%. In the SUMM-ID solution, we found three dimensions in the common judg- 
ment space. After Varimax rotation, these reflected the expected dimensions (that is, 
language capability, social competence and reasoning). 

The core matrix' is shown in Table 3. The off-diagonal core values exhibit the 
expected pattern: ludging in the presence of a stereotype yields illusory correlations 
in both description sets. Thus, by use of the stereotype “bloomer” teachers attribute 
high language and reasoning capabilities coupled with high social competence. Also 
In accordance with our hypothesis, we hnd rather independent dimensions in the 
non-stereotype conditions. That is, different judgment domains are used in a more 
unconfounded manner. 

From the Tucker approach, the expected judgment dimensions were also 
obtained. However, the off-diagonal values of the Tucker core matrix which are 
shown in Table 4 do not show a systematic pattern of high vs. low correlations due 



* Diagonal values of the core matrix will not be discussed in the following, as they do not contribute 
to the estimation of the models' construct validity in the present context. 
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Table 3 Core matrix of the SUMM-ID solution for the school achievement data 



Experimental 

conditions 




Judgment dimensions 


Language 

capability 


Social 

competence 


Reasoning 


Set 1: 










Non-stereotype 


Language capability 


0.72 








Social competence 


0.12 


0.68 






Reasoning 


0.05 


0.03 


0.66 


Stereotype 


Language capability 


0.61 








Social competence 


0.74 


0.52 






Reasoning 


0.34 


0.55 


0.56 


Set 2: 










Non-stereotype 


Language capability 


0.79 








Social competence 


0.14 


0.77 






Reasoning 


0.01 


0.07 


0.95 


Stereotype 


Language capability 


0.77 








Social competence 


0.57 


0.47 






Reasoning 


0.75 


0.62 


0.52 


Table 4 Core matrix of the Tucker model for the school achievement data 


Experimental 




Judgment dimensions 


conditions 














Language 


Social 








capability 


competence 


Reasoning 


Set 1: 










Non-stereotype 


Language capability 


1.73 








Social competence 


0.41 


1.19 






Reasoning 


0.95 


0.03 


1.26 


Stereotype 


Language capability 


1.46 








Social competence 


0.74 


1.00 






Reasoning 


0.27 


0.48 


1.75 


Set 2: 










Non-stereotype 


Language capability 


1.90 








Social competence 


0.33 


1.20 






Reasoning 


0.08 


0.16 


1.01 


Stereotype 


Language capability 


1.96 








Social competence 


0.07 


1.06 






Reasoning 


0.07 


0.10 


1.09 



to Stereotype use. Rather, some high values are found in the non-stereotype condi- 
tion and some low ones in the stereotype condition. Thus, the values do not indicate 
an increase in the magnitude of correlations between dimensions due to stereotype 
activation. 



Strategies of Model Construction for the Analysis of Judgment Data 



31 



In conclusion, there is no correspondence between the parameters obtained from 
the Tucker model and the experimental manipulations. In contrast, the SUMM- 
ID model reflects the expected structure of the common judgment space and the 
expected distortions of this space due to stereotype activation in every detail. That 
is, the parameters of the approach were sensitive to manipulations of stereotype 
activation and thus have construct validity. 



5 Conclusions 

At the outset, we distinguished two ways of model construction: Optimizing a 
mathematical criterion function which is the usual approach or integrating an under- 
lying theory such that the model parameters have construct validity. As to empirical 
validity, results from both approaches were comparable. 

As to construct validity, the second approach turns out to be superior. The 
reason is that optimizing a mathematical criterion function does not induce a 
non-theoretical rationale. Rather, this approach yields another formal theory about 
judgments which does not correspond to substantive theories. People simply do 
not consider all possible weights for person information, but only a very limited 
number. Thus, the unit weight rule is a better predictor of people’s judgments than 
Franklin’s rule. In more general terms, optimizing construct validity guarantees a 
close correspondence of the formal model to the underlying substantive theory. As 
a consequence, the model extracts the theoretically signiflcant parts from the data. 
As the presented empirical results have also shown, this does not imply to give 
up optimal predictions in the sense of minimizing discrepancies of predicted from 
obtained data. In conclusion, then, the second strategy of model construction should 
be focused more in future research if the aim is to develop valid models of the 
corresponding research domain. 
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Clustering of High-Dimensional Data via Finite 
Mixture Models 



Geoff J. McLachlan and Jangsun Back 



Abstract Finite mixture models are being commonly used in a wide range of 
applications in practice concerning density estimation and clustering. An attractive 
feature of this approach to clustering is that it provides a sound statistical frame- 
work in which to assess the important question of how many clusters there are in 
the data and their validity. We review the application of normal mixture models 
to high-dimensional data of a continuous nature. One way to handle the fitting of 
normal mixture models is to adopt mixtures of factor analyzers. They enable model- 
based density estimation and clustering to be undertaken for high-dimensional data, 
where the number of observations n is not very large relative to their dimension 
p. In practice, there is often the need to reduce further the number of parameters 
in the specification of the component-covariance matrices. We focus here on a new 
modified approach that uses common component-factor loadings, which consider- 
ably reduces further the number of parameters. Moreover, it allows the data to be 
displayed in low-dimensional plots. 



Keywords Common factor analyzers • Mixtures of factor analyzers ■ Model-based 
clustering ■ Normal mixture densities. 



1 Introduction 



Clustering procedures based on finite mixture models are being increasingly pre- 
ferred over heuristic methods due to their sound mathematical basis and to the inter- 
pretability of their results. Mixture model-based procedures provide a probabilistic 
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clustering that allows for overlapping clusters corresponding to the components of 
the mixture model. The uncertainties that the observations belong to the clusters are 
provided in terms of the fitted values for their posterior probabilities of component 
membership of the mixture. As each component in a hnite mixture model corre- 
sponds to a cluster, it allows the important question of how many clusters there are 
in the data to be approached through an assessment of how many components are 
needed in the mixture model. These questions of model choice can be considered in 
terms of the likelihood function (see, for example, McLachlan, 1982; McLachlan & 
Peel, 2000). 



2 Definition of Mixture Models 



We let Y denote a random vector consisting of p feature variables associated with 
the random phenomenon of interest. We let y j , . . . , denote an observed random 
sample of size n on Y . With the finite mixture model-based approach to density 
estimation and clustering, the density of Y is modelled as a mixture of a number 
(g) of component densities fi(y; 0,) in some unknown proportions tti, . . . , jTg, 
where / (y ; 0 , ) is specihed up to an unknown parameter vector 6 1 (i = 1 . . . , g). 
That is, each data point is taken to be a realization of the mixture probability density 
function (p.d.f.), 

g 

f{y-^) = Y,^Jdy,ei). (1) 

i=i 

where the mixing proportions Jii are nonnegative and sum to one. In density estima- 
tion, the number of components g can be taken sufficiently large for (1) to provide 
an arbitrarily accurate estimate of the underlying density function. 

The vector of all unknown parameters is given by $ = ico^ , ni, , Ttg-\Y , 
where o) consists of the elements of the 0 j known a priori to be distinct. For an 
observed random sample, y j , . . . , the log likelihood function for ^ is given by 



n 

logL(«») = ^log/(j^-; «>). (2) 

i = i 

The maximum likelihood (ML) estimate of is given by an appropriate root of 
the likelihood equation, 

aiogL(^)/a^» = 0 . (3) 

Solutions of (3) corresponding to local maximizers of log L ( ^ ) can be obtained via 
the expectation-maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977). 
In the event that the EM sequence is trapped at some stationary point that is not a 
local or global maximizer of log L(^) (for example, a saddle point), a small random 
perturbation of ^ away from the saddle point will cause the EM algorithm to diverge 
from the saddle point (McLachlan & Krishnan, 2008). 
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For clustering purposes, each component in the mixture model (1) corresponds 
to a cluster. The posterior probability that an observation with feature vector y j 
belongs to the i th component of the mixture can be expressed by Bayes’ theorem as 






Tl=i^hfh{y j\ Oh) 



a = , g- j = 1, 



n). (4) 



A probabilistic clustering of the data into g clusters can be obtained in terms of the 
fitted posterior probabilities of component membership for the data. 

An outright partitioning of the observations into g nonoverlapping clusters 
Cl, . . . , Cj is effected by assigning each observation to the component to which 
it has the highest estimated posterior probability of belonging. Thus the i th clus- 
ter C, contains those observations assigned to group G, . That is, Ci contains those 
observations _y^- with 2, y = {zj)i = 1, where 



Zij = I, {h = , g\h ^ i), 

= 0, otherwise. (5) 

As the notation implies, Zij can be viewed as an estimate of Zij which, under the 
assumption that the observations come from a mixture of g groups Gi, . . . , Gg, is 
defined to be one or zero according as the j th observation does or does not come 
from Gi {i = 1, . . . , g; j = 1, ...,«). 



3 Choice of Starting Values for the EM Algorithm 



McLachlan and Peel (2000) provide an in-depth account of the fitting of finite mix- 
ture models. Briefly, with mixture models the likelihood typically will have multiple 
maxima; that is, the likelihood equation will have multiple roots. Thus the EM algo- 
rithm needs to be started from a variety of initial values for the parameter vector ^ or 
for a variety of initial partitions of the data into g groups. The latter can be obtained 
by randomly dividing the data into g groups corresponding to the g components of 
the mixture model. With random starts, the effect of the central limit theorem tends 
to have the component parameters initially being similar at least in large samples. 
Nonrandom partitions of the data can be obtained via some clustering procedure 
such as /c-means. Also, Coleman, Dong, Hardin, Rocke, and Woodruff (1999) have 
proposed some procedures for obtaining nonrandom starting partitions. 

The choice of root of the likelihood equation in the case of homoscedastic normal 
components is straightforward in the sense that the ML estimate exists as the global 
maximizer of the likelihood function. The situation is less straightforward in the 
case of heteroscedastic normal components as the likelihood function is unbounded. 
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Usually, the intent is to choose as the ML estimate of the parameter vector ^ the 
local maximizer corresponding to the largest of the local maxima located. But in 
practice, consideration has to be given to the problem of relatively large local max- 
ima that occur as a consequence of a fitted component having a very small (but 
nonzero) variance for univariate data or generalized variance (the determinant of 
the covariance matrix) for multivariate data. Such a component corresponds to a 
cluster containing a few data points either relatively close together or almost lying 
in a lower-dimensional subspace in the case of multivariate data. There is thus a need 
to monitor the relative size of the fitted mixing proportions and of the component 
variances for univariate observations, or of the generalized component variances for 
multivariate data, in an attempt to identify these spurious local maximizers. 



4 Clustering via Normal Mixtures 



Frequently, in practice, the clusters in the data are essentially elliptical, so that it is 
reasonable to consider fitting mixtures of elliptically symmetric component densi- 
ties. Within this class of component densities, the multivariate normal density is a 
convenient choice given its computational tractability. 

Under the assumption of multivariate normal components, the fth component- 
conditional density fi(y, ^, ) is given by 



fi{y, 6i) = (j)(y, (6) 

where consists of the elements of //., and the ^p(p +1) distinct elements of 
X; (f = 1, . . . , g). Here 

</>(y, fii, ^i) = (2:^r^|5:,r'/^exp{-i(v - fij}. (7) 

One attractive feature of adopting mixture models with elliptically symmetric 
components such as the normal or t -densities, is that the implied clustering is invari- 
ant under affine transformations of the data; that is, invariant under transformations 
of the feature vector y of the form, 

y ^ Cy + a, (8) 

where C is a nonsingular matrix. If the clustering of a procedure is invariant under 
(8) for only diagonal C, then it is invariant under change of measuring units but 
not rotations. But as commented upon by Hartigan (1975), this form of invariance 
is more compelling than affine invariance. 

It can be seen from (7) that the mixture model with unrestricted component- 
covariance matrices in its normal component distributions is a highly parameterized 
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one with }^p{p + 1) parameters for each component-covariance matrix E, {i = 
1, .... g). As an alternative to taking the component-covariance matrices to be the 
same or diagonal, we can adopt some model for the component-covariance matrices 
that is intermediate between homoscedasticity and the unrestricted model, as in the 
approach of Banfield and Raftery (1993). They introduced a parameterization of 
the component-covariance matrix E , based on a variant of the standard spectral 
decomposition of E , . 

The mixture model with normal components (7) is sensitive to outliers since it 
adopts the multivariate normal family for the distributions of the errors. An obvious 
way to improve the robustness of this model for data which have longer tails than 
the normal or atypical observations is to consider using the multivariate f -family of 
elliptically symmetric distributions (McLachlan & Peel, 1998; McLachlan & Peel, 
2000, Chap. 7). It has an additional parameter called the degrees of freedom that 
controls the length of the tails of the distribution. Although the number of outliers 
needed for breakdown is almost the same as with the normal distribution, the outliers 
have to be much larger (see Hennig, 2003, 2004). 



5 Some Recent Extensions for High-Dimensional Data 



The EMMIX-GENE program of McLachlan, Bean, and Peel (2002) is an exten- 
sion of the EMMIX program of McLachlan, Peel, Basford, and Adams (1999) for 
the normal mixture model-based clustering of a limited number of observations 
that may be of extremely high-dimensions. It was called EMMIX-GENE as it was 
designed specifically for problems in bioinformatics that require the clustering of a 
relatively small number of tissue samples containing the expression levels of possi- 
bly thousands of genes. But it is applicable to clustering problems outside the field 
of bioinformatics involving high-dimensional data. In situations where the number 
of variables p is large, it might not be practical to fit mixtures of factor analyzers 
to data on all the variables, as it would involve a considerable amount of compu- 
tation time. Thus initially some of the variables may have to be removed. Indeed, 
the simultaneous use of too many variables in the cluster analysis may serve only 
to create noise that masks the effect of a smaller number of variables. Also, the 
intent of the cluster analysis may not be to produce a clustering of the observations 
on the basis of all the available variables, but rather to discover and study different 
clusterings of the observations corresponding to different subsets of the variables. 

Therefore, the EMMIX-GENE procedure has two optional steps before the hnal 
step of clustering the observations. The first step considers the selection of a subset 
of relevant variables from the available set of variables by screening the variables 
on an individual basis to eliminate those which are of little use in clustering the 
observations. The usefulness of a given variable to the clustering process can be 
assessed formally by a test of the null hypothesis that it has a single component 
normal distribution over the observations (McLachlan et al., 2002). A faster but 
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ad hoc way is to make this decision on the basis, say, of the sample interquartile 
range; if a variable has a distribution that is a mixture of normals, then its interquar- 
tile range will be greater than that for a single normal population. Even after this 
step has been completed, there may still remain too many variables. Thus there is a 
second step in EMMIX-GENE in which the retained variables are clustered (after 
standardization) into a number of groups on the basis of Euclidean distance so that 
variables with similar profiles are put into the same group. In general, care has to 
be taken with the scaling of variables before clustering of the observations, as the 
nature of the variables can be intrinsically different. Also, as noted above, the clus- 
tering of the observations via normal mixture models is invariant under changes in 
scale and location. The clustering of the observations can be carried out on the basis 
of the groups considered individually using some or all of the variables within a 
group or collectively. Eor the latter, we can replace each group by a representative 
(a metavariable) such as the sample mean as in the EMMIX-GENE procedure. 

6 Factor Analysis Model for Dimension Reduction 

As remarked earlier, the g-component normal mixture model with unrestricted 
component-covariance matrices is a highly parameterized model with \p{p -h 1) 
parameters for each component-covariance matrix Zi , (/ = 1 , . . . , g). As discussed 
above, Banheld and Raftery (1993) introduced a parameterization of the component- 
covariance matrix Z , based on a variant of the standard spectral decomposition of 
Z, (i = 1, . . . , g). However, if p is large relative to the sample size n, it may 
not be possible to use this decomposition to infer an appropriate model for the 
component-covariance matrices. Even if it is possible, the results may not be reliable 
due to potential problems with near-singular estimates of the component-covariance 
matrices when p is large relative to n . 

A common approach to reducing the number of dimensions is to perform a 
principal component analysis (PCA). But as is well known, projections of the fea- 
ture data y j onto the hrst few principal axes are not always useful in portraying 
the group structure; see the example in McLachlan and Peel (2000, Sect. 8.2). A 
global nonlinear approach can be obtained by postulating a factor-analytic model 
for each component-covariance matrix of the full feature vector Y j (Hinton, Dayan, 
& Revow, 1997; McLachlan & Peel, 2000; McLachlan, Peel, & Bean, 2003). This 
leads to the mixture of factor analyzers (MFA) model given by 



g 




(9) 



i = l 



where the i th component-covariance matrix Z , has the form 



Yi=BiBj + Di (i = l,...,g) 



( 10 ) 
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and where is a p q matrix of factor loadings and Z), is a diagonal matrix 

O' = 1, g). 

This MFA approach with the factor-analytic representation (10) on X, is equiva- 
lent to assuming that the distribution of the difference Y j — /r, can be modelled as 

Y j - /A; = Bi Uij + Cij with prob. jr,- (i = 1, . . . , g) (11) 

for j = where the (unobservable) factors U n, . . . , {/,,, are distributed 

independently A(0, Iq), independently of the etj, which are distributed indepen- 
dently A^(0, Di), where Z), is a diagonal matrix (/ = 1, . . . , g). 

The parameter vector $ now consists of the mixing proportions jr, and the 
elements of the /r, , the JS, , and the Z), . With this approach, the number of free 
parameters is controlled through the dimension of the latent factor space. By work- 
ing in this reduced space, it allows a model for each component-covariance matrix 
with complexity lying between that of the isotropic and full covariance structure 
models without any restrictions on the covariance matrices. The mixture of fac- 
tor analyzers model can be htted by using the alternating expectation-conditional 
maximization (AECM) algorithm of Meng and van Dyk (1997). 

A formal test for the number of factors can be undertaken using the likelihood 
ratio A, as regularity conditions (Rao, 1973) hold for this test conducted at a given 
value for the number of components g. For the null hypothesis that H(, \ q = qo vs. 
the alternative H\ \ q = q^ + \, the statistic —2 log A is asymptotically chi-squared 
with d = g{p — qo) degrees of freedom. However, in situations where n is not 
large relative to the number of unknown parameters, we prefer the use of the BIC 
criterion (Schwarz, 1978). Applied in this context, it means that twice the increase 
in the log likelihood (—2 log A) has to be greater than d log n for the null hypothesis 
to be rejected. 

The mixture of factor analyzers model is sensitive to outliers since it uses normal 
errors and factors. Recently, McLachlan, Bean, and Ben-Tovim Jones (2007) have 
considered the use of mixtures of t analyzers in an attempt to make the model less 
sensitive to outliers. In some other recent work, Montanari and Viroli (2007) have 
considered the use of mixtures of factor analyzers with covariates. 

As jq{q — 1) constraints are needed for Bi to be uniquely dehned, the number 
of free parameters in (10) is 



pq + p- \q(q -1). (12) 

Thus with this representation (10), the reduction in the number of parameters for 
Yi is 



r = \piP + 1 ) - M - .P + - 1 ) 

= \{(p-qf -ip + q)}, 



( 13 ) 
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assuming that q is chosen sufficiently smaller than p so that this difference is 
positive. The total number of parameters is 

d\ = {g-l) + 2gp + g{pq-\q(q-\)}. (14) 



Even with this MFA approach, the number of parameters still might not be 
manageable, particularly if the number of dimensions p is large and/or the num- 
ber of components (clusters) g is not small. In the sequel, we focus on how the 
MFA approach can be modified to provide a greater reduction in the number of 
parameters. 



7 Mixtures of Common Factor Analyzers 



Baek and McLachlan (2008) have proposed the Mixtures of Common Factor Ana- 
lyzers (MCFA) approach whereby the distribution of Y j is modelled as 

Y j — AU ij + Cij withprob. jii {i — g) (15) 

for j = where the (unobservable) factors U n, . . . , U i„ are distributed 

independently A(§, , fi,), independently of the eq, which are distributed indepen- 
dently A(0, D), where D is a diagonal matrix (i = I, g). Here A is a p x q 
matrix of factor loadings. The representation (15) is not unique, as it still has the 
same form if A were to be postmultiplied by any nonsingular matrix. Hence the 
number of free parameters in A is 



pq-q^ (16) 

To see that the MCFA model as specihed by (15) is a special case of the MFA 
approach as specified by (1 1), we note that we can rewrite (15) as 

Y j = AU ij -I- Cij 

= A!^,+A{Uij-^i) + eij 
= pi, + AKiK-\Uij-^,) + eij 

= //.,■ -t- BiU*j -h Cij, (17) 

where 





II 


(18) 


Bi 


= AKi, 


(19) 


U* 


= K-^{Uij-^,), 


(20) 
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and where the f/* are distributed independently A^(0, Iq). The covariance matrix 
of U*j is equal to I q, since Ki can be been chosen so that 

K-^^iK-^" = Iq {i = \,...,g). (21) 

On comparing (17) with (11), it can be seen that the MCFA model is a special 
case of the MFA model with the additional restrictions that 





{i = 1, .. 


g)^ 


(22) 


Bi = AKi 


(/ = L . 


g). 


(23) 


Di = D 


(/ = !,... 


, g)- 


(24) 



The latter restriction of equal diagonal covariance matrices for the component- 
specific error terms (Z), = Z)) is sometimes imposed with applications of the 
MFA approach to avoid potential singularities with small clusters (McLachlan et ah, 
2003). It follows from (23) that the ith component-covariance matrix E, has the 
form 

+ D (1 = 1, ...,g). (25) 

Concerning the restriction (23) that the matrix of factor of loadings is equal 
to AKi for each component, it can be viewed as adopting common factor load- 
ings before the use of the transformation Ki to transform the factors so that they 
have unit variances and zero covariances. Hence this is why Baek and McLachlan 
(2008) called this approach mixtures of common factor analyzers. It is also differ- 
ent to the MFA approach in that it considers the factor-analytic representation of the 
observations Y j directly, rather than the error terms Y j — fij. 

With the restrictions (22) and (25) on the component mean /r, and covariance 
matrix E , , respectively, the total number of free parameters is 

di = {g - 1) + P + q{p + g) + (l/2)gq{q + 1) - q^ (26) 

As the MFA approach allows a more general representation of the component- 
covariance matrices and places no restrictions on the component means it is in this 
sense preferable to the MCFA approach if its application is feasible given the values 
of p and g. If the dimension p and/or the number of components g is too large, then 
the MCFA provides a more feasible approach at the expense of more distributional 
restrictions on the data. In empirical results some of which are to be reported in the 
sequel we have found the performance of the MCFA approach is usually at least 
comparable to the MFA approach for data sets to which the latter is practically 
feasible. The MCFA approach also has the advantage in that the latent factors in its 
formulation are allowed to have different means and covariance matrices and are 
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not white noise as with the formulation of the MFA approach. Thus the (estimated) 
posterior means of the factors corresponding to the observed data can be used to 
portray the latter in low-dimensional spaces. 

The MCFA approach is similar in form to the approach proposed by Yoshida, 
Fliguchi, and Imoto (2004) and Yoshida, Higuchi, Imoto, and Miyano (2006) who 
also imposed the additional restrictions that the common diagonal covariance matrix 
D of the error terms is spherical. 



and that the component-covariance matrices of the factors are diagonal. We shall call 
this approach MCUFSA (mixtures of common uncorrelated factor spherical-error 
analyzers). The total number of parameters with this approach is 



8 Fitting of Factor- Analytic Models 

The fitting of mixtures of factor analyzers as with the MFA approach has been con- 
sidered in McLachlan et al. (2003), using a variant of the EM algorithm known as 
the alternating expectation-conditional maximization algorithm (AECM). With the 
MCEA approach, we have to fit the same mixture model of factor analyzers but with 
the additional restrictions (23) and (25) on the component means /a, and covariance 
matrices li , . The implementation of the EM algorithm for this model was devel- 
oped in Baek and McLachlan (2008). In the EM framework, the component label zj 
associated with the observation y j is introduced as missing data, where Zij = (zj)i 
is one or zero according as y j belongs or does not belong to the /th component of 
the mixture (i = \ g', j = \ , n). The unobservable factors Ujj are also 
introduced as missing data in the EM framework. 

As part of the E-step, we require the conditional expectation of the component 
labels Zij (i = I, ..., g) given the observed data point y j (j = 1, . . . , n). It 
follows that 



E\^{Zij I yj} = pr^{Z,y = 1 | y j} 

= n{yj;^) {i = I, ..., g; j = I, .... n), (29) 



where Ti{y j\ ^) is the posterior probability that y j belongs to the ith component 
of the mixture. Erom (4), it can be expressed under the MCFA model as 



D=aHp, 



(27) 



di = {g-\) + pq + l + 2gq- \q(q + 1). 



(28) 




Tti(p(yj-, AQ.iA^ 4- D) 



(30) 



Tfh=x^hHyj\ A^hA^ + D) 



for i = 1 , . . . , g; j = I, ... ,n. 
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We also require the conditional distribution of the unobservable (latent) factors 
U ij given the observed data j y ( 7 = 1 ,..., n). The conditional distribution of 
given y j and its membership of the ith component of the mixture (that is, Zij = 1 ) 
is multivariate normal, 





Uij 1 Jy, Zij = 1 ~ iV(§,y, Sliy), 


(31) 


where 


^ij =^i + yfiyj-A^,) 


(32) 


and 


^iy = {Iq - yf A)iti, 


(33) 


and where 


yi = (ASliA^ + Dy^ASli. 


(34) 


We can portray the observed data y j in ^-dimensional space by plotting the 
corresponding values of the iiij , which are estimated conditional expectations of the 
factors U ij, corresponding to the observed data points y j. From (31) and (32), 




E{U ij 1 Jy, Zij — 1) — §;y 






= !ii + yJ{yj-A^y 


(35) 


We let Uij denote the value of the right-hand side of (35) evaluated at the maximum 
likelihood estimates of , y, , and A. We can define the estimated value My of the 



j th factor corresponding to y j as 

g 

Uj = Tj iyj ■ $) Uij (7 = 1, . . . , (36) 

1 = 1 

An alternative estimate of the posterior expectation of the factor corresponding to 
the 7 th observation y j is defined by replacing r, (jy ; ^) by Zij in (36). 
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Clustering and Dimensionality Reduction 
to Discover Interesting Patterns in Binary Data 



Francesco Palumbo and Alfonso lodice D’Enza 



Abstract The attention towards binary data coding increased consistently in the 
last decade due to several reasons. The analysis of binary data characterizes several 
fields of application, such as market basket analysis, DNA microarray data, image 
mining, text mining and web-clickstream mining. The paper illustrates two differ- 
ent approaches exploiting a profitable combination of clustering and dimensionality 
reduction for the identification of non-trivial association structures in binary data. 
An application in the Association Rules framework supports the theory with the 
empirical evidence. 

Keywords Association • Binary data • Cluster analysis ■ Dimensionality reduction. 



1 Introduction 



The relevance of binary data analysis was underlined by Sir D.R. Cox in his paper 
“The Analysis of Multivariate Binary Data” (Cox, 1972). Cox pointed out that: 

It is fairly common to have multivariated data in which the individual variates take one of 
just two possible values that can be coded as 0 and 1. [. . . ] we shall concentrate on the 
genuinely situation in which there are several, and indeed possibly many, binary response 
variates. We then have to study the association between these variables and not just the 
dependence of one variate on others. 

Cox’s intuition turns out to be strictly relevant still nowadays: that is, the 
increasing attention towards binary data coding characterized the last decade. Such 
tendency depends on several reasons, some of them overcome what previewed by 
Sir Cox. Binary coding is the most basic form to store information in computers. 
In addition, binary data represent the most straightforward coding to automatically 
collect and store information about studied phenomena. From a computational point 
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of view, binary coding of data ensures fast, flexible and low memory storage require- 
ments. A binary data set consists of a collection of binary sequences, each of them 
arranged in p dimensional binary row-vector, being p the number of considered 
attributes. The range of applications involving binary data structures is wide: from 
Market Basket Analysis (MBA), to Web Mining and Microarray Data Analysis. 
Text Mining and Image Analysis are further frameworks of application dealing with 
binary data structures. In all of these contexts, the number of binary sequences is 
usually very large and indeed even huge. Let T„ be a consistent estimator of 6, deal- 
ing with very large samples, it is worth noticing that any deviation of T„ from the 
null hypothesis Hq : 9 = 6* will turn out to be significant with probability equal 
to 1, as the sample size « — ^ oo. 

The general aim of this paper is the study of association between binary attributes 
in order to identify of homogeneous sets of data, when both n and p tend to be large. 
In the binary data analysis framework, this proposal introduces two approaches 
exploiting a profitable combination of clustering and dimensionality reduction for 
the identification of non-trivial association structures. 

The article consists of this introduction and of the following four further sec- 
tions: Sect. 2 contains basic notation and definitions of the data structures; Sect. 3 
introduces the clustering problem when dealing with binary data; Sect. 4 illustrates 
the proposed strategies; finally, the last section illustrates the approach capabilities, 
through an application on a real data-set. The application refers to Association rules 
(AR) mining (Agrawal, Imielinski, & Swami, 1993) framework. 



2 Basic Notation and Definitions 

In this section data structures and the corresponding notation are introduced: 

- Z = [zij] {n X p) binary data matrix; Zij indicates the presence of attribute j in 
sequence i . 

- n Number of binary sequences. 

- p Number of attributes. 

- K Number of groups of binary sequences. 

- H Number of groups of attributes. 

In particular, H and K define a partition on rows and on columns of Z, respec- 
tively. 

The notion of Association Rule (AR) was firstly introduced by Agrawal et al. 
(1993) in Data Mining context to study the association in large binary datasets. The 
AR formalization is based on the definition of two indexes: support and confidence, 
and it turns out to be particularly favorable to illustrate the proposal. Let Zj and Zj* 
be two « X 1 vectors of Z, then the association rule j — ^ j * provides informa- 
tion on the co-occurrence of j and j * . The support indicates the number of binary 
sequences containing both j and j * . Supports of attribute pairs can be arranged in 
the following square symmetric matrix 



S = n“*Z'^Z. 



(1) 
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The confidence indicates the number of sequences containing j , given that they 
contain j * . Let us indicate with C the p p matrix having as general term the pair 
confidence measures 

C = t71Y)~\ (2) 

where D is a diagonal matrix such that the general term djj = Sj. = Sjj', with 
j, }' = , p. General terms Sjj' of S, and Cjjr of C can be also conceived in 

terms of empirical probabilities: the former is the joint probability P{j fl j'), and 
the latter the conditional probability P(j \ j') of j and j'. 

Finally, p/,, F and X are defined as: 

- p/, vector indicating the attributes in the /rth group 

- F (X X /?) matrix with general element fkj being the frequency of the yth 
attribute in the /:th group 

- X (« X K) matrix that assigns each sequence to one of the K groups 



3 Cluster Analysis of Binary Data 

Clustering is a crucial task in data mining and the application of cluster analysis 
to binary data received a great deal of attention in the literature. In the AR mining 
framework, several contributions have shown the effectiveness of clustering tech- 
niques within strategies for the identification of association patterns. In particular, 
Plasse, Niang, Saporta, Villeminot, and Leblond (2007) proposed an attribute-wise 
cluster analysis in order to identify and observe patterns of rare attributes, which 
can be missed using frequent pattern counting algorithms on the whole binary data 
base. In the same direction, even if with a different aim, lodice D’Enza, Palumbo, 
and Greenacre (2007) propose a row-wise cluster procedure to identify groups 
of homogenous binary sequences: the entire data structure is then partitioned in 
local sub-sets of binary sequences. The comparison of local and global association 
structures leads to the identification of interesting sets of attributes. 

The outcome of any clustering exploitation crucially depends on the assumed 
logical distance: Sect. 3.1 illustrates some of the most used metric measures for 
binary data, and it shows their relation with the support and the confidence measure. 



3.1 Dissimilarity Measures for Binary Data 

Let Z be the disjunctive coded table of Z, having n rows and 2x p columns. Taking 
into account two general attributes j and j* the product ZjZ^* determines the 
following 2x2 matrix: 



( 3 ) 
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Table 1 Dissimilarity 
indexes for binary data 



Simple matching coef. 


b+c 

a+h+c+d 


Jaccard coef. 


h+c 

a+h+c 


Russel & Rao coef. 


b-\-c-\-d 

a+h+c+d 


Euclidean distance 


Vfc + c 



where a indicates the number co-presence of attributes; b and c correspond to the 
non-matchings; d indicates the number of co-absences. The quantities introduced 
in (3) can be used to express support and confidence, respectively 



5 



a 

u b c d 






a 

c — 

a + c 



= PU* I ])■ 



(4) 



The set {a.b,c,d} permits to define most of the dissimilarity/similarity mea- 
sures for binary data: Table 1 shows some of the widely used metric measures 
for binary data. Formulating the distance measures for binary variables by the set 
{a,b,c,d} makes evident the connection between the distances and the association 
measures in (4): indeed, it is worth noticing that Jaccard coefficient is defined as 
a(a + b + c)“' and ifs corresponding metric measure is indicated in Table 1. Use 
of Euclidean or Jaccard metrics to cluster binary sequences is then consistent with 
the support/confidence framework. 



4 Strategies for Patterns Identification 

The identification of patterns of association in binary data is based on a profitable 
combination of clustering and dimensionality reduction. Such a combination has 
already proposed in the case of continuous variables and permits to get good results 
under general hypotheses, see Vichi and Kiers (2001). 

As pointed out in the literature, dimensionality reduction, that is a quantification 
of qualitative attributes, can be performed column-wise and row-wise depending 
on the aim of the analysis. In particular, column-wise quantification of binary 
attributes is performed via Multiple Correspondence Analysis (MCA) (Greenacre, 
2007) and it aims to detect multiple association between attributes, removing noise 
and redundancies in data. 

In the row-wise quantification, the presence of a latent cluster structure char- 
acterizing binary sequences is assumed: attributes are then quantified taking into 
account groups of homogeneous sequences. Row-wise quantification is obtained 
via Non-Symmetric Correspondence Analysis (NSCA) (Lauro & D’ Ambra, 1984). 
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We then consider two alternative approaches based respectively on (1) partition 
clustering and Multiple Correspondence Analysis (Greenacre, 2007); (2) agglomer- 
ative clustering and Non-Symmetric Correspondence Analysis (Lauro & D’ Ambra, 
1984). 



4.1 Column-Wise Quantification of Binary Attributes 

The use of factorial methods to quantify qualitative data was introduced by Saporta 
(1975). The quantihcation consists of a projection of starting variables on an orthog- 
onal subspace: the number of chosen dimensions determines the degree of synthesis 
provided by quantification. The advantages in using MCA to study associations of 
binary data are then to remove noise and redundancies in data as well as to obtain a 
synthetic representation of multiple associations characterizing attributes. 

MCA is a Correspondence Analysis of a Burt table given by B = Z^Z. The 
correspondence matrix is P = total ’ row/column margins denoted by r. 
A reduced rank approximation of P is given by the SVD of its centered version Q, 
with general element 



qi] — 



{Pij-firj) . . 



, ij = 1, 



,,p. 



( 5 ) 



The solution is obtained through a singular value decomposition Q = UAU^, with 
U and A the eigenvector and eigenvalue matrices. The principal co-ordinate of 
the ith row point on the i'th dimension is obtained through gis = ais^s, with j 
being the corresponding standard co-ordinate, that is j being the 5th 

eigenvalue and m, ^ being the i th element of the corresponding eigenvector. 

As an improvement to disqual-like quantification of binary variables, lodice 
D’Enza and Palumbo (2007) propose to take into account the clustering structures 
underlying the binary sequences. In particular, authors suggest to use a partition 
algorithm (A-means, MacQueen, 1967) to determine K groups of homogeneous 
sequences: such determined local structures are then taken into account in the quan- 
tihcation. In other words, the quantihcation of the binary variables takes into account 
both the whole data association structure (global) and the within association struc- 
tures (local). Let Z^ be the indicator matrix of the binary sequences belonging to 
group k,{k = 1, . . . , K)\ attributes in Z^ are projected as supplementary informa- 
tion on the factorial subspace determined through the MCA of Z. In particular, for 
each of the K groups, the supplementary coordinate of the yth variable on the 5th 
dimension of the subspace of approximation is 



n 

Sjs.k = 

/ = 1 



^ij.k 



Z.j.k 



,k 



(6) 
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with z.j.k being the support of attribute j within group k (Greenacre & Nenadic, 
2005). 

We conclude this section showing that the MCA of Z corresponds to a PCA on 
the standardized binary matrix and that it maximizes the squared Euclidean distance 
between attributes. In particular, MCA aims at finding the best orthogonal subspace 
that maximizes the sum of the following distance among attributes (columns of Z): 



d^{zi,j,Zi.j') 






”( 




a + b 

(a + by (a + b){a + c) ' (a + c)^ 



{a + b){a + c) 



for all the possible pairs {],]'}, j 

E n 2 

i=xZlj = since Zij = 



^ j and j, j' = Recalling that: 



{0, 1}, Vi, i and that 



D'=i 



a-\-b 

{a+h)^ ’ 



—2 ^’=*^'-'^‘ 1 ' = . and = , “I'y . Then the result of (7) shows how 

Z.J'Z.J i.a+b)(a+c) (.a+cy ^ ’ 

the distance between attributes can be re-expressed in terms of squared Euclidean 
distance (which is {^b -F c)^) standardized by the independence condition 
(a + b)(a -I- c). 



4.2 Row-Wise Quantification of Binary Attributes 

Let jr = 7Ti , :7r2, . . . , tta: be a random vector containing the probabilities for a binary 
sequence to belong to the A:th group, where k = I, ... ,K. The aim is to partition 
sequences in K groups such that: 

max! ^ {E [P{Xt \ Z,)] - E [R(Z,)]), (7) 

j 

where P{Xk) is the approximation of jtk and P{Xk \ Zj) the probability that a 
sequence belongs to group k given that it contains the y th attribute. 

Maximizing expression (7) with respect to all the K groups leads to 

Ef=i (e;=i fkjP(Xk I Zj) - fk.PiXk)) 



that is to estimate jr given Z. 

The target function (7) corresponds to the solution of the Lauro and D’Ambra’s 
Non Symmetrical Correspondence Analysis Model: in particular, we refer to the 
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NSC A formalization of the problem introduced by Palumbo and Verde (1996). The 
expectations in (7) can be re-expressed in terms of attribute frequencies as follows: 



p y2 

E [P{Xk \Zj)] = Y^^ and E [P{Xk)] 
J=i 



E 



y=i 




n 



where fkj is the general element of matrix F. Furthermore, as pointed out by Lauro 
and Balbi (1999), the following relation holds 



1 ^ 

-E 

n ^ 

k=\ 




(9) 



The left side quantity of expression (9) according to an algebraic formalization 
corresponds to 




( 10 ) 



where A = d iag(Z^Z) and 1 is a n -dimensional vector of ones. 
The target function is then 



1 

n 





= AU, 



( 11 ) 



that is, to compute eigenvalues and eigenvectors, stored in the diagonal matrix and 
in the matrix U, respectively. 

However, in expression (11), matrices X, A and U are unknown, thus a direct 
solution is not possible. It takes an alternate step procedure consisting of the 
following steps: 

• Step 0: pseudo-random generation of matrix X. 

• Step 1: singular value decomposition of the matrix resulting from (11), obtaining 
the matrix 'P, such that 'P = (z{A)~^Z^ — f 11^^ XUA^ . 

• Step 2: matrix X is updated according to an Euclidean squared distance based 
clustering algorithm on the quantified sequences ('P matrix). 

Steps 1 and 2 are iterated until convergence: if the increase of the quantity in (10) 
is not significantly greater than zero, from one iteration to the following one, the 
procedure stops. Such stopping rule only depends on the stability of the solution. 
Remark that the clustering phase (step 2) is performed by running several times, 
with different starting points, the X-means algorithm on the quantified sequences: 
this provides stability to the obtained result and, as a consequence, to the output of 
the whole procedure. Anyway, the entire computational effort of the procedure is 
acceptable as the clustering algorithm runs over low dimensional data. 

As for row- wise quantifications, the synthesis of binary attributes is obtained by 
<P = z'^xuaj. 
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An agglomerative clustering is performed on $*, which is the matrix of quan- 
tifications with a reduced number of dimensions. A reduced number of dimensions 
of the input matrix <I>* crucially eases the computation of agglomerative procedure. 
The output of the agglomerative clustering is a hierarchy: the user-defined cutoff 
point determines H sets of homogeneous attributes. The dendrogram representa- 
tion of the hierarchy eases the user to choose the size of the H sets of attributes. As 
a result of the whole procedure the starting binary matrix Z is partitioned 'm K yi H 
high-association blocks. 



5 Empirical Evidence from a Real Data-Set 

In this section we used the Epub data set, that is distributed with the “arules” pack- 
age for association rules mining for the statistical environment cran-R. The Epub 
data set contains the download history of documents from the electronic publication 
platform of the Vienna University of Economics and Business Administration. Start- 
ing data consists of 465 binary attributes and 3,975 sequences (sessions). Attributes 
with less than five have been discarded, then filtered data refer to 1,024 sequences 
with respect to 283 publications. For lack of space we report just some of the graph- 
ical output resulting by the application of the row-wise and column-wise strategies. 
The factorial map representation of attributes resulting from MCA on the whole data 
is in the left side of Fig. 1. The representation of the global association structure is 
quite difficult to interpret due to the large number of attributes to display. The right 
side of Fig. 1 shows an example of local association structure representation char- 
acterizing one of the K groups of sequences determined via ^-means (that in MBA 
may correspond to niches of similar customers). When the data matrix is sparse, 
each group of sequences is described by a (small) sub-set of considered attributes, 
this increases the effectiveness of the factorial representation. Furthermore, the 
agglomerative clustering of the low-dimensional quantifications defines H patterns 




Fig. 1 Factorial map of attributes: whole data and within group representation 
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of homogeneous attributes: attributes in the same pattern are represented in the same 
color. In addition, using a brushing technique it is possible to monitor the association 
structure of attributes along the K groups of sequences and identify characteris- 
tic local associations. Analyzing the factorial maps, the user chooses which of the 
KxH identified blocks of sequences/attributes to mine for AR. Output AR mined in 
each block that can be used, for example, to describe the buying behavior of a niche 
of customers. With respect to the strategy of row-wise quantification. Fig. 2 refers to 
the visualization of sequences in the step by step procedure to emphasize the under- 
lying clustering structure. Remark that just three of the eight steps are represented 
due to the lack of available space. In particular the first window on the left repre- 
sents the initialization step, with the random assignment of sequences to clusters. 
The remaining windows, represent the sequences at an intermediate step (step 5) 
and at the final step (step 8). Although the procedure is automatic, the step by 
step displays put in evidence the increasing degree of separation among sequences 
until the optimal sub-space of approximation is achieved. Once determined the opti- 
mal subspace of approximation, the quantification of attributes is represented in the 
left side of Fig. 3. Again, different colors correspond to different sets of attributes 
obtained via agglomerative clustering: the corresponding hierarchy is represented 
as a dendrogram in the right side of Fig. 3. 













* 
















- 





Fig. 2 Factorial map of sequences iteration by iteration (iterations 1 , 5 and 8) 




Fig. 3 Dendrogram and factorial map representation of attributes 
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As in the column-wise approach, the procedure defines K x H blocks of 
sequences/attributes which can be investigated separately as they represent highly 
homogeneous patterns of association. 

The presented approaches are both characterized by a wide graphical support 
to ease the user in the investigation process. Of course the user interaction should 
be possible in a data mining process but not strictly necessary. In this sense future 
issues in the presented approach are towards the optimal automatic choice of both 
K and H . Furthermore, the number of defined blocks (K x H) can be large in real 
world applications, thus an evaluation criterion to establish which blocks to analyze 
at a deeper level is required. 



6 Conclusion 

The study of the association in binary tables implies a difficult of accomplishment 
effort in the results interpretation. Usually, in the association study of binary vari- 
ables, the analysis focuses on the study of the inter-dependencies considering two, 
three variables at a time, and so on; therefore the total number of relationships to 
take into account becomes quickly large when the number of variables grows up. 
So that, the main problem to cope with is to control the maximum number of sig- 
nificant associations to take into account. As already pointed out, the large number 
of observations makes ineffective the classical inferential approach: any deviation 
of the sample statistic from Hq is significant: that is absolutely coherent with the 
general inference theory, yet definitely unprofitable in the present context. 

In the Data Mining context, the maximum number of AR to take into account 
is usually properly controlled tuning the support and confidence thresholds. Tight 
bounds limit the total number of AR but they may lead to discard interesting 
associations, too; on the other hand, wide bounds do not help to solve the problem. 

Another strategy consists in executing an ex-post analysis. Bruzzese and Davino 
(2000) proposed a graphical post analysis in the study of AR. 

The present paper shows two different approaches combining quantification and 
clustering of binary data, where an alternate steps procedure aims at optimising 
the same criterion. Therefore, the strategy assures better results under the (quite 
common) hypothesis that data consist of unknown homogeneous groups of units 
whose behaviours reflect their characteristics. 

The proposed strategy main benefit consists in preserving the possibility of 
having both analytical results and graphical displays, in accordance with the Mul- 
tidimensional Data Analysis logic. At the same time, in spite of a low loss of 
information, the variable quantification offers the opportunity to deal with a reduced 
number of continuous variables, which are independent, too. 
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Kernel Methods for Detecting the Direction 
of Time Series 



Jonas Peters, Dominik Janzing, Arthur Gretton, and Bernhard Scholkopf 



Abstract We propose two kernel based methods for detecting the time direc- 
tion in empirical time series. First we apply a Support Vector Machine on the 
finite-dimensional distributions of the time series {classification method) by embed- 
ding these distributions into a Reproducing Kernel Hilbert Space. For the ARMA 
method we fit the observed data with an autoregressive moving average process and 
test whether the regression residuals are statistically independent of the past values. 
Whenever the dependence in one direction is significantly weaker than in the other 
we infer the former to be the true one. 

Both approaches were able to detect the direction of the true generating model 
for simulated data sets. We also applied our tests to a large number of real world 
time series. The ARMA method made a decision for a significant fraction of them, 
in which it was mostly correct, while the classification method did not perform as 
well, but still exceeded chance level. 

Keywords Classification • Support vector machines ■ Time series. 



1 Introduction 

Consider the following problem: We are given m ordered values X \, from 
a time series, but we do not know their direction in time. Our task is to find out 
whether Xi, ... , X„ or A ,„, . . . ,Xi represents the true direction. The motivation to 
study this unusual problem is two-fold: 

(1) The question is a simple instantiation of the larger issue of what charac- 
terizes the direction of time, which is an issue related to philosophy and physics 
(Reichenbach, 1956), in particular to the second law of thermodynamics. One pos- 
sible formulation of the latter states that the entropy of a closed physical system can 
only increase but never decrease (from a microphysical perspective the entropy is 
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actually constant in time but only increases after appropriate coarse -graining the 
physical state space Balian, 1992). This may suggest the use of entropy criteria to 
identify the time direction in empirical data. However, most real-life time series 
(such as that given hy data from stock markets, EEGs, or meteorology) do not stem 
from closed physical systems, and there is no obvious way to use entropy for detect- 
ing the time direction. Moreover, we also want to detect the direction of stationary 
processes. 

(2) Analyzing such asymmetries between past and future can provide new 
insights for causal inference. Since every cause precedes its effect (equivalently, 
the future cannot influence the past), we have, at least, partial knowledge of the 
ground truth (Eichler & Didelez, 2007). 

In this work we propose the classification method for solving this problem: Con- 
sider a strictly stationary time series, that is a process for which the w-dimensional 
distribution of (Xt+h, X;+i+/,, . . . , Z, +„+/,) does not depend on h for any choice 
of w G N. We assume that the difference between a forward and backward sam- 
ple ordering manifests in a difference in the finite-dimensional distributions of 
{Xt, Xi+i , . . . , Xt+w) and (Xt+„, Xt+„-i , . . . , Xf). For many time series we thus 
represent both distributions in a Reproducing Kernel Hilbert Space and apply a 
Support Vector Machine within this Hilbert Space. 

Shimizu, Hoyer, Hyvarinen, and Kerminen (2006) applied their causal discovery 
algorithm LiNGAM to this problem. Their approach was able to propose a hypothet- 
ical time direction for 14 out of 22 time series (for the other cases their algorithm did 
not give a consistent result); however, only 5 out of these 14 directions turned out 
to be correct. Possible reasons for this poor performance will be discussed below. 
Nevertheless, we now describe the idea of LiNGAM because our ARMA method 
builds upon the same idea. Let e be the residuum after computing a least squares 
linear regression of T on Z for real-valued random variables Z, T. If Z and e are 
statistically independent (note that they are uncorrelated by construction) we say 
that the joint distribution P(X, T) admits a linear model from Z to T. Then the only 
case admitting a linear model in both directions is that E(Z, T) is bivariate Gaus- 
sian (except for the trivial case of independence). The rationale behind LiNGAM 
is to consider the direction as causal that can better be fit by a linear model. This 
idea also applies to causal inference with directed acyclic graphs (DAGs) having n 
variables Zi , . . . , Z„ that linearly influence each other. 

There are three major problems when using conventional causal inference tools 
(Pearl, 2002; Spirtes, Glymour, & Scheines, 1993) in determining time series direc- 
tion. First, the standard framework refers to data matrices obtained by iid sampling 
from a joint distribution on n random variables. Second, for interesting classes of 
time series like MA and ARMA models (introduced in Sect. 2), the observed vari- 
ables (Z() are not causally sufficient since the (hidden) noise variables influence 
more than one of the observed variables. Third, the finitely many observations 
are typically preceded by instances of the same time series, which have not been 
observed. 

Besides the classification method mentioned before we propose the following 
ARMA approach: for both time directions, we fit the data with an ARMA model and 
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check whether the residuals occurring are indeed independent of the past observa- 
tions. Whenever the residuals are significantly less dependent for one direction than 
for the converse one, we infer the true time direction to be the former. To this end, 
we need a good dependence measure that is applicable to continuous data and finds 
dependencies beyond second order. Noting that the ARMA method might work for 
other dependence measures, too, we use the Hilbert Schmidt Independence Criterion 
(HSIC) (Gretton, Bousquet, Smola, & Scholkopf, 2005). This recently developed 
kernel method will be described together with the concept of Hilbert Space embed- 
dings and ARMA processes in Sect. 2. Section 3 explains the method we employ for 
identifying the true time direction of time series data. In Sect. 4 we present results 
of our methods on both simulated and real data. 



2 Statistical Methods 

2.1 A Hilbert Space Embedding for Distributions 

Recall that for a positive definite kernel k a Hilbert space H of functions / : A” — K 
is called a Reproducing Kernel Hilbert Space (RKHS) if k(x, .) € H Vx e A” and 
{f,k{x, .)) = /(x) V/ G 7f. Here, k{x, .) denotes a function A” — K with 
y k(x,y). We can represent received data in this RKHS using the feature map 
<I)(x) := k{x , .). 

We can further represent probability distributions in the RKHS (Smola, 
Gretton, Song, & Scholkopf, 2007). To this end we define the mean elements 
/r[P](.) = .), which are vectors obtained by averaging all k{X, .) over 

the probability distribution P. Gretton, Borgwardt, Rasch, Scholkopf, and Smola 
(2007) now introduced the Maximum Mean Discrepancy (MMD) between two 
probability measures P and Q, which is defined in the following way: mapping 
the two measures into an RKHS via P i-^ fi[P], the MMD is the RKHS distance 
||/x[P] — /x[Q] \\n between these two points. Assume the following conditions on the 
kernel are satisfied: k{x, y) = \[f(x—y) for some positive definite function i/f, and \[f 
is bounded and continuous. Bochner’s theorem states that i/f is the Fourier transform 
of a nonnegative measure A. Assume further that A has a density with respect to the 
Lebesgue measure, which is strictly positive almost everywhere. It can be shown that 
under these conditions on the kernel the embedding /x is injective (Sriperumbudur, 
Gretton, Fukumizu, Lanckriet, & Scholkopf, 2008) and therefore the MMD is zero if 
and only if P = Q. Note that the Gaussian kernel ^(x, y) = exp (— ||x— y|p/(2a^)) 
on satisfies all conditions mentioned above. In our experiments we chose 2 ct^ to 
be the median of all distances ||x — y |p, following Gretton et al. (2007). 

If only a finite sample {Xi, . . . , X^) of a random variable is given we can esti- 
mate the mean element by /x[P^] = ^ •)■ If th® kernel is strictly 

positive definite the two function values are the same if and only if the samples are 
of the same size and consist of exactly the same points. In this sense these Hilbert 
space representations inherit all relevant statistical information of the finite sample. 
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2.2 Hilbert Schmidt Independence Criterion 

As mentioned earlier, the ARMA method requires an independence criterion that is 
applicable to continuous data. The Hilbert-Schmidt Independence Criterion (HSIC) 
is a kernel based statistic that, for sufficiently rich Reproducing Kernel Hilbert 
Spaces (RKHSs), is zero only at independence. The name results from the fact that 
it is the Hilbert-Schmidt norm of a cross-covariance operator (Gretton et ah, 2005). 
Following Smola et al. (2007), we will introduce HSIC in a slightly different way, 
however, using the Hilbert Space Embedding from Sect. 2. 1 . 

For the formal setup let X and Y be two random variables taking values on 
{X, F) and {y. A), respectively; here, X and y are two separable metric spaces, 
and F and A are Borel a-algebras. We dehne positive definite kernels k(., .) and 
/(., .) on the spaces X and y and denote the corresponding RKHS as Tix and Tiy, 
respectively. The product space (A” x 3^, F ® A) is again a measurable space and 
we can dehne the product kernel .) ■ /(., .) on it. X and Y are independent if and 
only if pl^T) _ (gjpJ' xhis means a dependence between X and Y is equivalent 
to a difference between the distributions pi^T) (g> P^. 

The HSIC can be dehned as the MMD between and P^ (giP^. It can further 
be shown (Gretton et al., 2005) that for a hnite amount of data, a biased empirical 
estimate of HSIC is given by a V-statistic, 

HSIC = m-hvHKHL, 

where H = 1 — ^ • (1, . . . , 1)'(1, . . . ,1), K and L are the Gram matrices of the 
kernels k and I respectively, and m is the number of data points. Under the assump- 
tion of independence (where the true value of HSIC is zero), m ■ HSIC converges 
in distribution to a weighted sum of Chi-Squares, which can be approximated by a 
Gamma distribution (Kankainen, 1995). Therefore we can construct a test under the 
null hypothesis of independence. 



2.3 Autoregressive Moving Average Models 

Recall that a time series (Xt)tei, is a collection of random variables and is called 
strictly stationary if (Xu , . . . , Xt„) and (Xn+h, ■ ■ ■ , Xt„+h) are equal in distribution 
for all tj ,h G Z. It is called weakly (or second-order) stationary if X, has hnite 
variance and 



EXi = and cov(A,, A,+/,) = yh Vf, h G Z, 

i.e., both mean and covariance do not depend on the time t, and the latter depends 
only on the time gap h. 
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Definition 1. A time series is called an autoregressive moving average 

process of order {p, q), written ARMA(/>, q), if it is weakly stationary and if 

p 9 

Xt = (pi Xf-i + 9 j €t- j + €t Wt € Z , 

i=i j=i 

where €t are iid and have mean zero. The process is called an MA process if p = Q 

and an AR process if ^ = 0. 

An ARMA process is called causal if every noise term e? is independent of all 
Xi for i <t. 

We call a causal ARMA process time- reversible, if there is an iid noise sequence 
€t, such that Xt = piXt+i + + ft where e? is independent of 

all Xj with i > t. 

In the theoretical work Weiss (1975) and Breidt and Davis (1991) the authors 
call a strictly stationary process time-reversible if (Xq, . . . , Xh) and (Xq, . . . . X-h) 
are equal in distribution for all h. However, this notion is not appropriate for our 
purpose because, a priori, it could be that forward and backward processes are both 
ARMA processes even though they do not coincide in distribution. 



3 Learning the True Time Direction 
3.1 The Classification Method 

In Sect. 2.1 we saw how to represent a sample distribution in an RKHS. Given a 
training and a test set we can perform a linear SVM on these representations in the 
RKHS. This linear classifier only depends on pairwise dot products, which we are 
able to compute: 



1 - /// - /i » - m n 

l—Y^k(xi,.), -Y^k{xj,.)\ = J2J2k(xi,Xj) . (1) 

\ m ^ n ^ / mn ^ ^ 

' i=l i=\ I 1 = 1 7=1 

Note that this method allows us to perform an SVM on distributions. We note that 
this is only one possible kernel on the space of probability measures; see Hein and 
Bousquet (2005) for an overview. 

We now apply this idea to the finite-dimensional distributions of a time series. 
Each time series yields two points in the RKHS (correct and reversed direction), 
on which we perform the SVM. The classification method can be summarized as 
follows: 

• Choose a fixed window length w and take for each time series many finite- 
dimensional samples X,, = (Z,, , . . . , X^+w), = (Z^j, . . . , Z^^+w), = 

{Xt ,„, . . . , Xt„,+w)- The ti can be chosen such that — (t,- -|- w) = const, for 
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example. The larger the gap hetween two samples of the time series, the less 
dependent these samples will be (ideally, we would like to have iid data, which 
is, of course, impossible for structured time series). Represent the distribution of 
{Xt , . . . , Xt+„) in the RKHS using the point ^ YlT=i , •) • 

• Perform a (linear) soft margin SVM on these points using (1) for computing the 
dot product. 



3.2 The ARMA Method 

We state without proof the following theorem (Peters, 2008): 

Theorem 1. Assume that {Xt) is a causal ARMA process with iid noise and 
non-vanishing AR part. Then the process is time-reversible if and only if the process 
is Gaussian distributed. 

If a time series is an ARMA process with non-Gaussian noise, the reversed time 
series is not an ARMA process anymore. Theorem 1 justifies the following proce- 
dure to predict the true time direction of a given time series: 

• Assume the data come from a causal ARMA process with non- vanishing AR part 
and independent, non-Gaussian noise. 

• Fit an ARMA process in both directions to the data (see, e.g., Brockwell & Davis, 
1991) and consider the residuals e,, Ct respectively. 

• Using HSIC and a significance level a test if e, depends on Xt-i, Xt- 2 , ... or 
if Ct depends on Xt+i. Xt+ 2 , . . . and call the p-values of both tests pi and p 2 , 
respectively. According to Theorem 1 only one dependence should be found. 

If the hypothesis of independence is indeed rejected for only one direction (i.e., 
exactly one p-value, say pi, is smaller than a) and additionally, P 2 — P\ > S, 
then propose the direction of p 2 to be the correct one. 

• If the noise seems to be Gaussian (e.g., perform the Jarque-Bera test Jarque & 
Bera, 1987) do not decide. 

• If both directions lead to dependent noise, conclude that the model assumption is 
not satisfied and do not decide. 

For the method described above, two parameters need to be chosen: the signif- 
icance level a and the minimal difference in /i-values &. When a is smaller and 
& is larger, the method makes fewer decisions, but these should be more accurate. 
Note that for the independence test we need iid data, which we cannot assume here. 
The time series values may have the same distribution (if the time series is strictly 
stationary), but two adjacent values are certainly not independent. We reduce this 
problem, however, by not considering every point in time, but leaving gaps of a few 
time steps in between. 

Note that the ARMA method only works for ARMA processes with non-Gaussian 
noise. Gaussianity is often used in applications because of its nice computational 
properties, but there remains controversy as to how often this is consistent with the 
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data. In many cases using noise with heavier tails than the Gaussian would he more 
appropriate (e.g., Mandelbrot, 1967). 



4 Experiments 

Simulated Data 

We show that the methods work for simulated ARMA processes provided that the 
noise is not Gaussian distributed. We simulated data from an ARMA(2,2) time series 
with fixed parameters {(j)\ = 0.9, 4>i = —0.3, 9i = —0.29 and 02 = 0.5) and using 
varying kinds of noise. The coefficients are chosen such that they result in a causal 
process (see Brockwell & Davis, 1991 for details). For different values of r we 
sampled 

e, ~sgn(Z)-|Z|'-, 

where Z ~ Af(0, 1), and normalized in order to obtain the same variance for all r. 
Only r = 1 corresponds to a normal distribution. Theorem 1 states that the reversed 
process is again an ARMA(2,2) process only for r = 1 , which results in the same 
finite-dimensional distributions as the correct direction. Thus we expect both meth- 
ods to fail in the Gaussian case. However, we are dealing with a finite amount of data 
and if r is close to 1, the noise cannot be distinguished from a Gaussian distribution 
and we will still be able to fit a backward model reasonably well. 

The classification method performed well on the simulated data (see Fig. 1). 
Notice, however, that using the same coefficients in all simulations makes the prob- 
lem for the SVM considerably easier. When we used different parameters for each 
simulated time series the classification method performed much worse (at chance 
level), while the ARMA method could still detect the correct direction in most cases. 




Fig. 1 Classification method on the ARMA processes. For each value of r (i.e., for each kind of 
noise) we simulated 100 instances of an ARMA(2,2) process with fixed coefficients and divided 
them into 85 time series for training and 15 for testing; this was done 100 times for each r. The 
graph shows the average classification error on the test set and the corresponding standard deviation 
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r 



Fig. 2 For each value of r (expressing the non-Gaussianity of the noise) we simulated 100 
instances of an ARMA(2,2) process with 500 time points and show the acceptance ratio for the for- 
ward model {solid line) and for the backward model {dashed line). When the noise is significantly 
different from Gaussian noise (r = 1 ), the correct direction can be identified 



For the ARMA method we fit an ARMA model to the data without making use of 
the fact that we already know the order of the process; instead we used the Akaike 
Information Criterion which penalizes a high order of the model. If we detected a 
dependence between residuals and past values of the time series, we rejected this 
direction, otherwise we accepted it. Obviously, for the true direction we expect that 
independence will only be rejected in very few cases (depending on the significance 
level). For the independence test we used a significance level of a = 0.01 . See Fig. 2 
for details. 



Real World Data 

In order to see if the methods are applicable to real data as well, we collected data 
consisting of 200 time series with varying length (from 100 up to 10,000 sam- 
ples) from very different areas: finance, physics, transportation, crime, production 
of goods, demography, economy, EEG data and agriculture. Roughly two thirds of 
our data sets belonged to the groups economy and finance. Eig. 3 shows the results 
for the classification method. 

Since the performance of the ARMA method strongly depends on the chosen 
parameters, we give its results for different values. The classification consistently 
exceeds 50%; and for more conservative parameter choices, a larger proportion of 
time series are correctly classified. See Eig. 4 for details. 



5 Conclusion and Discussion 

We have proposed two methods to detect the time direction of time series. One 
method is based on a Support Vector Machine, applied to the finite-dimensional 
distributions of the time series. The other method tests the validity of an ARMA 
model for both directions by testing whether the residuals from the regression were 
statistically independent of the past observations. 
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error rate on the test set 



Fig. 3 Classification method on the time series collection. Five hundred times we chose randomly 
a test set of size 20, trained the method on the remaining 180 time series and looked at the perfor- 
mance on the test set. For the SVM regularization parameter we chose C = 10, which resulted 
in a training error of 29.8% ± 1.8% and a test eiTor of 35.7% ± 10.5%. We reached the same 
performance, however, for values of C which were several orders of magnitude lower or higher 
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Fig. 4 ARMA method on the time series collection. We cut the longer time series into smaller 
pieces of length 400 and obtained 576 time series instead of 200. We show the results for different 
values of the parameters: the minimal difference in /t-values S ranges between 0% and 20%, and 
the significance level a between 10% and 0.1%. The point with the best classification performance 
corresponds to the highest value of S and the lowest value of a 



Experiments with simulated data sets have shown that we were able to identify 
the true direction in most cases unless the ARMA processes were Gaussian dis- 
tributed (and thus time-reversible). For a collection of real world time series we 
found that in many cases the data did not admit an ARMA model in either direction, 
or the distributions were close to Gaussian. For a considerable fraction, however, the 
residuals were significantly less dependent for one direction than for the other. For 
these cases, the ARMA method mostly recovered the true direction. The classifica- 
tion method performed on average worse than the ARMA method, but still exceeded 
chance level. 

Classification accuracies were not on par with the classification problems 
commonly considered in machine learning, but we believe that this is owed to the 
difficulty of the task; indeed we consider our results rather encouraging. 
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It is interesting to investigate whether there are more subtle asymmetries between 
past and future in time series that cannot be classified by our approach (i.e., if there 
is a simple generative model in the forward but not the backward direction in a 
more general sense). Results of this type would shed further light on the statistical 
asymmetries between cause and effect. 
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Statistical Processes Under Change: 
Enhancing Data Quality with Pretests 



Walter Radermacher and Sabine Sattelberger 



Abstract Statistical offices in Europe, in particular the Federal Statistical Office 
in Germany, are meeting users’ ever more demanding requirements with innovative 
and appropriate responses, such as the multiple sources mixed-mode design model. 
This combines various objectives: reducing survey costs and the burden on intervie- 
wees, and maximising data quality. The same improvements are also being sought 
by way of the systematic use of pretests to optimise survey documents. This paper 
provides a first impression of the many procedures available. An ideal pretest com- 
bines both quantitative and qualitative test methods. Quantitative test procedures 
can be used to determine how often particular input errors arise. The questionnaire 
is tested in the field in the corresponding survey mode. Qualitative test procedures 
can hnd the reasons for input errors. Potential interviewees are included in the ques- 
tionnaire tests, and their feedback on the survey documentation is systematically 
analysed and used to upgrade the questionnaire. This was illustrated in our paper 
by an example from business statistics (“Umstellung auf die Wirtschaftszweigklas- 
sifikation 2008” - Change-over to the 2008 economic sector classification). This 
pretest not only gave important clues about how to improve the contents, but also 
helped to realistically estimate the organisational cost of the main survey. 

Keywords Data quality ■ Pretests. 



1 Introduction 

Production of high quality statistics is the main task of official statistics. Tech- 
nological progress, globalisation, the increasing significance and diversification 
of information and its distribution are only some general terms for the changes 
we are faced with today. Needless to say, those changes strongly affect the work 
of statistical services and pose challenges that can only be met with innovative 
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and appropriate methods, for instance a model of “multiple sources mixed-mode 
design”. The point of this model is to maximise data quality and minimise the cost 
and burden for the survey participants. 

Furthermore, even with the implementation of this elaborated model of dif- 
ferent data sources and different survey modes, traditional types of surveys will 
still be needed, at least in the short term. That is why this article focuses on an 
additional method for improving data quality and moreover contributing to the user- 
friendliness of surveys - namely the use of pretests during, or ideally before, the 
actual data production process. Pretests fulhl a number of functions: they minimise 
non-sampling errors, they reduce the burden for the respondents of comprehensive 
questionnaires and they test the feasibility of a concept in practice. By combin- 
ing quantitative and qualitative methods for pretesting, significant increases in data 
quality can be achieved. 



2 The Model of Multiple Sources Mixed-Mode Design 

Over recent years, the held of official statistics has been subjected to a root- 
and-branch modernisation in order to meet the growing needs of users (e.g., the 
increasing need for high-quality data, rapidly produced and free-of-charge, where 
possible) despite cuts in resources. In this context, consideration has been given 
to the model of multiple sources mixed-mode design, which is currently gaining 
ground in business statistics. Two fundamental principles are associated with this 
model: 

1. Multiple sources: traditionally, official statistics in the held of business statistics 
used mainly primary data. In other words, a primary survey was carried out using 
a questionnaire developed by the statistical service. For cost reasons alone, this 
approach is being increasingly called into question. The information needed for 
statistics is being collected - where possible and sensible - from various existing 
sources (administrative data or public records), not designed primarily for sta- 
tistical purposes. Moreover, methods of estimation are being increasingly used. 
Technological advances in data processing facilitate the use of these sources. 
Primary and secondary data are, where possible, being combined, in order to hll 
gaps in the data. In the area of business statistics, the transition to a records-based 
system has already begun, the basis of which is the development of a business 
register. Eventually, this will replace or suspend some of the traditional censuses. 
Where primary surveys do take place, often not all the observation units but only 
those exceeding certain cut-off limits and thus having a central importance to the 
overall result of the statistics will be surveyed. Consequently, fewer units are sur- 
veyed, the data from which must therefore meet even higher standards of validity 
and reliability. 

2. Mixed mode: Official statistical offices are also moving towards less expen- 
sive modes of capturing data, so that, for example, paper-and-pencil surveys are 
replaced by online procedures, automatic data collection from company accounts 
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or data yielded from public records. In this way, the statistical offices can enter, 
process, evaluate, and prepare data much more efficiently, also taking advantage 
of progress in the automation of data processing. The survey process is also being 
made much easier for respondents, which also improves data quality. 

The purpose of multiple sources mixed-mode design is to use primary surveys only 
if the results yielded from administrative data do not meet the statistical require- 
ments and any anomalies cannot be removed even through sufficiently high-quality 
estimates based on additional information. If primary surveys are performed, they 
should be as easy and efficient as possible. The model of multiple sources mixed- 
mode design has the potential to achieve clear cost reductions and meets the ever 
more demanding requirements of users, as described above. But not only that, the 
needs of those providing the information (especially in the case of obligatory sur- 
veys) are taken into account, with the excessive burden of direct and sometimes 
cumbersome questionnaires being considerably reduced, which can enhance data 
quality. 

However, it is hardly likely that, in future, it will be possible to completely 
eliminate the “traditional” method of data collection - surveys. Where a survey is 
necessary to collect the desired information, optimising the questionnaire can help 
to make life significantly easier for respondents, reduce survey costs and enhance 
data quality. “Good” questionnaires are vital for a survey to succeed. For this rea- 
son, questionnaires developed by the official statistical offices are being increasingly 
evaluated using pretests and, if necessary, revised so as to ensure that they are 
“good” (i.e., comprehensible and easy to use). 



3 Qualitative and Quantitative Test Methods 
for Questionnaires 

The statistical offices in Germany currently prepare more than 170 federal statis- 
tics based on questionnaire data. Any weaknesses due to deficient questionnaires 
are very difficult or expensive to rectify at a later stage (e.g., by carrying out 
additional plausibility tests or contacting respondents for further clarihcation). To 
further enhance their data, statistical offices look not only for sampling errors but 
also measuring errors which can be attributed to the survey method, the answering 
behaviour of interviewees or the interviewer. These non-sample-related errors can 
be investigated using pretests. 

Suitable test methods can be used to test questionnaires at various stages in the 
development and processing phases. For example, the opinion of expert users or 
expert statisticians can be sought. Potential interviewees were rarely called upon in 
the past to assess questionnaires - if at all, then by the interviewers, who indirectly 
passed on the problems of “their” respondents. In future, respondents should be 
consulted more often, using suitable methods. 
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Many testing methods are available to evaluate questionnaires. Depending on the 
intention of the test, they can be used individually or in combination at the vari- 
ous stages of the development of the questionnaire.' So different tests are used for 
new questionnaires being developed than for existing data collection instruments, or 
where data are already available from previous surveys. Apart from procedures to 
support the development of questionnaires and which form the basis for a well- 
thought-out questionnaire, a distinction is made between two main categories - 
qualitative and quantitative test methods. Qualitative methods are often used in the 
(pretest) laboratory,^ whilst quantitative ones are used in the field, i.e., in situ with 
interviewees and under field conditions. Lastly, one of the main differences between 
these two approaches is that the qualitative test procedure uses methods from quali- 
tative and cognition psychology and the number of test persons is limited, as a rule, 
from 15 to 30. The quantitative test methods, on the other hand, involve many sub- 
jects (as a rule, more than 100); this is a good way to find out about the frequency 
of problems, and only a few qualitative elements are addressed. The sections below 
provide an overview of the main qualitative and quantitative test methods. 



3.1 Qualitative Test Methods 

The three most commonly used qualitative test methods are cognitive interviews 
with individuals and various questioning techniques (most in the laboratory), expert 
discussion groups and observing test persons as they complete questionnaires.^ 



3.1.1 Cognitive Interviews 

The most important qualitative test method is definitely the cognitive interview, 
whereby potential interviewees are involved, with the aim of clarifying how inter- 
viewees answer and which problems they may have. In other words, the question 
and answer process is repeated, so as to show up problems with certain questions 
attributable, for example, to differences in interpretation, opaque questioning or lack 
of information. In this way, the reasons for erroneous entries become evident. It is 
rare for the entire questionnaire to be tested using qualitative methods - usually, just 
certain areas are chosen. 



* This overview focuses mainly on the handbook published by Eurostat to support the statistical 
offices of the European Community, which includes concrete recommendations on questionnaire 
development and testing; see Eurostat (2006a, pp. 85-1 19). 

^ Qualitative methods not performed in the laboratory include company-site- visits, where cognitive 
interviews are conducted at the premises of the businesses themselves. 

^ Focus groups are usually described in the literature as a qualitative test method, although they 
are often used in questionnaire development as early as the development stage. As this overview 
focuses on the methods generally conducted in the pretest laboratory, focus groups are not 
discussed in detail. For an introduction, see Krueger and Casey (2002). 
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Cognitive interviews are used mainly where experts have developed a question- 
naire reflecting the research and data needs of official statistics and which could be 
used in the field. In order to check whether the questions and answers are in line with 
the original aims of the questionnaire, test persons are invited to interviews in order 
to give feedback on the questionnaire, with the aim of finding information about 
how understandable the questions are, about the answering process, and potential 
difficulties with the questionnaire and their causes."^ 

“Cognitive” in this context means that the individual thought processes during 
answering (understanding the question; recalling the information; deciding how to 
answer; giving the answer) are examined individually.^ The aim is to test the devel- 
opment of the questionnaire, i.e., the presentation of question and answer categories, 
the design and the sequence of questions. The respondent should take an active role 
in the process. 

Cognitive interviews are very time-intensive and in-depth, both to prepare and to 
implement. They should last no longer than one hour. As a rule, small samples of 
around 15 test persons are invited to the laboratory and recorded audio visually as 
they are interviewed individually. In certain cases, laboratory interviews are unsuit- 
able, for example company surveys, which are generally conducted on site, i.e., at 
the company premises. This has the benefit that interviewees have the necessary 
reference documents to hand. 

Cognitive interviews are conducted in a structured and (semi-)standardised way 
using an interview guide, so as to ensure comparability between subjects. The 
interviews focus on certain important and/or difficult questions or terms. Vari- 
ous individual techniques can be applied, depending on the requirements of the 
survey or the interest of the research. These individual techniques are outlined 
below. 

Questioning techniques in cognitive interviews: 

• Probes 

• Think aloud 

• Paraphrasing 

• Confidence rating 

• Sorting terms and situations 

• Response latency 



Probes 

In German-speaking countries, probes are the most common technique, as they are 
very good at finding out how interviewees interpret certain terms or how difficulties 



Concerning the techniques and how eognitive interviews are conducted, see, in particular 
Willis (2005). For a briefer overview see also Priifer and Rexrodt (2005). 

^ See, for example, Tourangeau, Rips and Rasinski (2000) and Biemer and Fecso (1995). 
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in answering questions arise. The technique is thus always used when potential prob- 
lems are suspected with a particular question. For example, interviewees are asked 
how they interpret particular terms (comprehension probing) or how they selected 
an answer from the given alternatives (category selection probing). To formulate the 
probes it is vital to already have an idea of where the cognitive difficulties of cer- 
tain questions (could) lie. Probes can be categorised in various ways. They can be 
placed in different positions in temporal terms - either “follow-up” (directly after 
the question is answered) or retrospectively (after the whole questionnaire has been 
answered). They can also be broken down by their objective - whether the issue is 
how the question was answered or how the test person managed to recall or look up 
the information to give a particular reply. In company surveys, it is very interesting 
to have company documents described or see them at first hand. 



Think Aloud Technique 

With the think aloud technique, the test person is asked to verbalise (i.e., say out 
loud) all the considerations that go into answering a question, either directly as the 
question is being answered (parallel) or after the question or the entire questionnaire 
has been answered (retrospective). This shows up any problems of understanding, 
misinterpretations, information recall strategies and reactions to sensitive questions. 
The success of the technique is very much dependent on the extent to which the test 
person is willing or able to articulate their thoughts. 



Paraphrasing 

Paraphrasing means that interviewees recast a question in their own words, thus giv- 
ing an idea of whether they understood the question and interpreted it as intended. 
Alternative, possibly simpler or clearer wording can then be found for questions, 
for example if they always recast the question using the same words. There are two 
ways of paraphrasing: either the test person tries to recall the question word-for- 
word, or tries to describe the question in his or her own words. It is usually easier 
to judge whether test persons have really understood a question if they summarise 
it in their own words. This technique is particularly suitable for identifying com- 
plex and/or confusing questions, if, for example, important details such as reference 
timeframes are forgotten. However, paraphrasing is not equally useful for all test 
persons. Practice shows that the technique is very effective but rather off-putting 
for interviewees, so it should not be overused. It delivers clues as to where wrong 
interpretations may occur. It is recommended that, after the paraphrasing, probing 
techniques be used in order to clarify understanding of the terms in a more targeted 
way.® 



® See Priifer and Rexrodt (2005, p. 13). 
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Confidence Rating 

In confidence rating, test persons are asked, using a scale or freely, to estimate how 
reliable their answer is. The aim of these subjective estimates is to find out which 
questions are difficult to answer. Low reliability values can be attributed mainly 
to ignorance about the subject in question or difficulties in recalling the required 
information. 



Sorting of Terms or Situations in Categories 

The aim of sorting is to gain information about the categorisation of certain terms 
or the understanding of basic concepts. It is an additional way of testing under- 
standing. The test persons receive a certain number of situations or terms written 
on cards, which they then have to sort on the basis of their own or given criteria. 
It is possible to find out, for example, which paid activities test persons consider as 
paid employment, or to gain feedback about how subgroups of interviewees think 
differently (e.g., do schoolchildren with a paper round feel “employed” to the same 
extent as adults who deliver newspapers to make a living?). Analysing the resulting 
subjective classifications gives indications about how important it will be to further 
explain certain terms in the questionnaire and what terms are best understood by 
interviewees.^ 



Response Latency 

Response latency is the period of time between question and answer. The method is 
used to identify difficulties in interpreting questions, recalling information or select- 
ing categories of answer. The required time can be measured particularly well by 
all types of computer-based recording. It has to be borne in mind that individuals 
take different lengths of time to answer questions and that such differences do not 
necessarily have anything to do with the question. 



3.1.2 Expert Discussion Groups 

Discussion groups with experts are used to test whether, for example, the right 
questions corresponding to the data required are being asked. Expert questionnaire 
designers examine the questionnaire in accordance with the criteria of their dis- 
cipline, in order to find out whether it will really be able to elicit the required 
information and whether the content and the design pose any problems to inter- 
viewers or interviewees. These groups can be used early in the questionnaire 



’ See Martin (2004) and Priifer and Rexrodt (2005). 
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development phase or to examine already existing questionnaires. The discussion 
can either he structured (using a standardised coding scheme) or unstructured, with 
the first variant being recommended. However, these expert groups do not include 
the views of interviewees, so they should be combined with other methods.* 



3.1.3 Observation 

Observations are conducted to identify difficulties with the wording, sequence and 
layout of questions. They also provide a reliable estimate of how long it takes to 
fill out a questionnaire. Test persons complete a questionnaire, preferably in the lab- 
oratory, while being recorded. Their behaviour is observed (what they say, facial 
expressions, gestures) so that any difficulties with the questionnaire can be identi- 
fied (e.g., repeatedly reading the questions aloud, frowning, flicking back and forth 
through the questionnaire). Follow-up probes are often used to elicit information 
about the reasons for particular behaviours. 



3.2 Quantitative Test Methods 

In the evaluation of questionnaires under field conditions, methods are used which 
replicate real life as closely as possible, in terms of the questioning situation, length, 
selection and sequence of questions. Field methods can be applied either in a specific 
field test, in a pilot study (to also observe other survey processes) or in parallel with 
an ongoing survey during the actual data collection. They include a larger number 
of units and allow quantitative analyses. They focus on the entire questionnaire, not 
certain parts. Test methods such as behaviour coding, interviewer and interviewee 
debriefing, follow-up interviews and experiments are used. 



3.2.1 Behaviour Coding 

Behaviour coding is the standardised, systematic encoding of interaction between 
the interviewer and interviewee, in order to assess the quality of the questionnaire 
(and the instructions for the interviewer). In most cases, this method is implemented 
in the field, but it is possible to implement it under laboratory conditions too. Using 
an encoding scheme,® the reactions of interviewers and interviewees are classified 
and an overall impression is gained of how the question and answer process is work- 
ing. For example, one area examined is whether the interviewer asks the question 
exactly or with minor or major changes. Interviewees’ reactions are also encoded, 
e.g., interrupting the question to give the answer, asking for more information or 



* See also minimum standards of the US Census Bureau (2003). 

® For an exemplary encoding scheme, see (Eurostat 2006b, pp. 08-109). 
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answering inadequately. Behaviour coding thus tries to document qualitative aspects 
in the question and answer process, which can then be used in the field. As the 
encoding is often performed in parallel, by an additional observer or by retrospec- 
tively listening to a recording, it is very labour intensive, although its quantitative 
approach can help to provide information about frequently encountered problems. 
The method does not try to assess the reasons for particular behaviours, as cognitive 
interviews would do. 



3.2.2 Interviewer and Interviewee Debriefing 

Interviewer debriefing is a structured form of group discussion after the field work, 
whereby interviewers tell the experts who designed the questionnaire about their 
experiences of using it. Possible variants are telephone conferences or questions on 
paper. The aim is to gain feedback about the effectiveness of the questionnaire, so as 
to identify any weak points which could pose problems for interviewers. Indirectly, 
limited information about interviewees’ difficulties with the questionnaire can also 
be found. The technique is suitable for new or ongoing surveys. 

In addition, special interviewee debriefings can be held, to find out how inter- 
viewees experienced a particular questionnaire and the sources they used to provide 
their answers. Interviewee debriefings are usually conducted using structured indi- 
vidual follow-up interviews, group interviews or with the aid of a self-completion 
feedback questionnaire, either directly after the questions are asked or within a few 
days. Hypotheses about potential problems must already exist, developed, for exam- 
ple, on the basis of previous quantitative tests, such as behaviour coding or item 
non-response. 



3.2.3 Follow-up Interviews 

Follow-up interviews are carried out shortly after the real interview by another inter- 
viewer. The aim is to allow interviewees to say how they interpreted certain terms 
or questions, how they arrived at their answers, the extent to which the answer given 
actually reflects their own opinion and how important it is for them in general to 
give as accurate answers as possible. They are semi-structured interviews, are very 
time and cost intensive and should therefore be limited to a few selected questions 
which can be handled in detail. 



3.2.4 Experiments 

“Split ballot” procedures are used to identify and minimise undesirable measur- 
ing effects in advance. Different versions of questions and/or questionnaires (e.g.. 



*** See, for example, Tourangeau et al. (2000) and Biemer and Fecso (1995, pp. 57-281). 
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household or individual questionnaires with different designs) or different meth- 
ods of data collection (e.g., paper and online questionnaires) are tested under field 
conditions and on a sufficiently large population.'^ Important aspects in determin- 
ing the “best” option include data quality, the distribution of answers and cost. 
Experiments can be conducted before a new survey or be incorporated in ongo- 
ing ones. The test persons are divided into subgroups, each of which is presented 
a particular variant of the questionnaire or is questioned in a particular way. The 
size of the sample for each subgroup should be sufficient for statistically reliable 
results from the total population to be drawn. It is also vital for the interviewees 
to be allocated to a subgroup randomly, so that differences in answers can defi- 
nitely be attributed to the type of survey or method of questioning, rather than other 
influencing factors. 



3.2.5 Post-evaluation Methods 

Post-evaluation methods serve to assess the quality of the questionnaire and are used 
after the actual collection of data. They can be used to analyse data sets of ongoing 
surveys. Aspects such as non-responses to particular questions, the distribution of 
answers per question, data preparation or imputation rates are usually investigated 
on the basis of existing data. 

Item non-responses are analysed by looking at those questions which elicited 
the highest rate of non-responses, for possible causes. In addition to examining 
the sensitive nature of various questions and the length of the questionnaire, the 
detailed investigation includes further variables in order to find out whether particu- 
lar response patterns correlate with certain interviewee characteristics (e.g., level of 
education). 

It is vital to consider the distribution of answers per question if, in a prior exper- 
iment, various variants of the survey instrument or method were tested. This can 
determine whether the different variants lead to different answer patterns. Gener- 
ally, an analysis of this kind is conducted in conjunction with qualitative methods 
subsequent to the evaluation of the data sets. The reasons for any measuring errors 
discovered can be more easily explained by way of a qualitative approach than an 
exclusively quantitative one. 

High imputation rates also suggest that certain characteristics cannot be identified 
via the questionnaire without further measures. 

The test methods for questionnaires in Sect. 3 were categorised according to their 
methodology and the number of test persons and/or interviewees. Depending on the 
stage of development of the questionnaire, qualitative methods tend to be used at 
the start, and quantitative methods later on. Eurostat’s Handbook of Recommended 
Practices recommends combining quantitative and qualitative methods, as their dif- 
ferent perspectives complement each other well. The Federal Statistical Office found 



See, for example, Fowler (2004). 
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that a qualitative pretest can generally be carried out in 3 months, whilst a quanti- 
tative test takes at least 6 months. Existing data sets can also be evaluated. They 
conceal a multitude of revealing findings which, because of the priority given to 
content-based analyses, are often neglected for reasons of time. 

We describe below, using a practical example, how a pretest should ideally be 
conducted in official statistics and which lessons can be drawn from the results, 
on the basis of a pretest concerning the change-over to the 2008 classification of 
economic activities, carried out in 2007 in the Federal Statistical Office. The com- 
bination of quantitative and qualitative test methods, with very different objectives, 
allows for a comprehensive evaluation of the questionnaire and therefore makes this 
case particularly interesting. 



4 The Pretest Concerning the Change-over to the 2008 
Economic Sector Classification 



As part of the harmonisation and updating of the classification of economic activi- 
ties at European level (NACE Rev. 2), a revised national classification of economic 
activities (WZ 2008) was to be introduced in Germany in 2008.^^ The allocation 
to a particular sector is an important feature in the register of companies, form- 
ing, for most economic statistics, a basis to identify total populations and groups 
of respondents (including the development of a sampling plan and the implementa- 
tion of sampling) and it therefore had to be adapted at much the same time as the 
new classification. In the run-up to the update, it became clear that a lot of German 
companies could be converted automatically from the old (WZ 2003) to the new 
(WZ 2008) classification without any major difficulties, for example by converting 
a particular code in WZ 2003 into exactly one in WZ 2008. In such cases, a primary 
survey would not be needed. However, it was found that about 700,000 companies 
had to be contacted in order to clarify their belonging to a particular sector of activity 
in accordance with the new classification system. This was done by a questionnaire 
in two waves. 

Consequently, a written self-completion questionnaire was developed for 2008 
and had to be tested before use. Moreover, a number of organisational procedures in 
the design of the survey had to be reviewed beforehand. The aim of the pretest was 
to estimate the cost, in terms of organisation and resources, of the main survey, to 
test the quality of entries and develop a coherent appearance for the final version of 
the paper questionnaire. In order to meet these objectives, a two-stage pretest was 
carried out; 



Further pretests were dealt with in more detail in other Statistisches Bundesamt publications and 
are mentioned here for reference: Blanke, Gauckler, and Sattelberger (2008), Blanke, Gauckler, 
and Wein (2007), Gauckler, Komer, and Minkel (2007). 

Information on this subject is provided in German by the relevant department of the 
Statistisches Bundesamt at http://www.destatis.de/jetspeed/portal/cms/Sites/destatis/Intemet/DE/ 
Navigation/Klassifikationen/UmstellungWZ2008. 
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Stage 1 : Quantitative Pretest (Written Pilot Questionnaire, to Fill out Oneself) 

Around 500 companies'"^ were contacted and asked to complete the pilot question- 
naire and return it to the Federal Statistical Office. One hundred and thirty-three 
companies returned a completed questionnaire. The questionnaire was then evalu- 
ated and the data were entered into an Access database. The organisational proce- 
dure could thus be tested in a test run, and an idea was also gained about the nature 
and frequency of certain difficulties with the questionnaire. Evident problems and 
units with particular classification problems were noted. 



Stage 2: Qualitative Pretest (Cognitive Interviews by Telephone) 

On the basis of the completed questionnaires, 17 companies with particular clas- 
sification problems were contacted by telephone, via the contact person given on 
the form, so that certain questions for clarification could be asked. A “script” was 
used, structuring and standardising the cognitive interview,'^ so as to guarantee 
comparability between the subjects. 



Results 

The quantitative pretest showed that some 10% of the companies written to were not 
reached by post on the first attempt, and so neutral sample attrition rates in the same 
order are to be expected also for the main survey. Extrapolated to the main survey, 
this means that around 70,000 units would, if necessary, have to be followed up. An 
approximate amount of work could be estimated for this. 

Around half (N = 66) of the participating companies {N = 133) categorised 
themselves unambiguously to one of the economic sectors listed and can there- 
fore be recorded fully automatically. The other half (N = 67) could not classify 
themselves unambiguously. In a rounded 70% of these cases, the main activity was 
summarised in their own words in the free-text field. These questionnaires would 
then have to be allocated manually to a particular economic sector. The quality 
of the free-text entries varied widely. Overall, follow-up investigations would be 
unavoidable in an estimated 17% of cases of businesses in the main survey. 

The qualitative pretest (A = 17) looked at why certain companies had problems 
with the classification. Three main ones were mentioned: one group of companies 
found the listed sectors outdated, a second group took the view that none of the 
sectors listed applied to them, whilst a third group thought that the listed sectors 
were not detailed enough. 



These were businesses picked at random from the company register, classihed by economic 
sector, for which an unambiguous, automatic allocation would not have been possible. 

For information on how to run cognitive interviews, see in particular Willis (2005). For a briefer 
overview, see also Priifer and Rexrodt (2005). 
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As a result of the pretest, the latter group will find instructions in the main survey 
as to how to tackle the classification problem. For the other two groups, it is clear 
that updating the classification of economic activities by way of a survey will also 
lead to an updating of information in the register of companies, which will certainly 
be of interest and benefit to both users and the companies themselves. 

The pretest also provided feedback about a snappy and attractive title for the 
survey. The existing title (“Abfrage zum Wirtschaftszweig” - Query concerning the 
Economic Sector) was very abstract and also implied that this information is not 
available, which irritated companies. Accordingly, the main survey will be entitled 
“Aktualisierung der wirtschaftlichen Haupttatigkeit” (Updating the classification 
of main economic activity) which is more appealing to companies and makes the 
intention clear, i.e., to present updated information. 
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Abstract Several learning tasks comprise hierarchies. Comparison with a “gold- 
standard” is often performed to evaluate the quality of a learned hierarchy. We 
assembled various similarity metrics that have been proposed in different disciplines 
and compared them in a unified interdisciplinary framework for hierarchical evalua- 
tion which is based on the distinction of three fundamental dimensions. Identifying 
deficiencies for measuring structural similarity, we suggest three new measures for 
this purpose, either extending existing ones or based on new ideas. Experiments with 
an artificial dataset were performed to compare the different measures. As shown by 
our results, the measures vary greatly in their properties. 

Keywords Clustering ■ Gold-standard evaluation ■ Hierarchy • Ontology learning. 



1 Introduction 

An important task of organizing information is to induce a hierarchical structure 
among a set of information items or to assign information resources to a predefined 
hierarchy. In order to assess the quality of a learned hierarchical scheme, often an 
external “gold-standard” is invoked for comparison. For this task, various similarity 
metrics have been proposed, mostly depending on the characteristics of the applied 
learning procedure. This work aims at bringing together the different disciplines 
by presenting and comparing existing gold-standard based evaluation methods for 
learning algorithms that generate hierarchies. Our goal is explicitly not to identify 
the “best” method or metric, bnt to inform the choice of an appropriate measure in a 
given context. In the following, we will start with giving a definition of a hierarchy 
which abstracts from the considered learning tasks. In Sect. 2, we review existing 
evaluation strategies from the literature. Based thereon, we present an interdisci- 
plinary framework for evaluation in Sect. 3, emphasizing the strong similarities of 
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evaluation in different disciplines. The section is completed with a set of new mea- 
sures that addresses deficiencies in the currently available methods. The paper is 
concluded with several experiments in Sect. 4. 

As mentioned before, different learning tasks work on different kinds of hierar- 
chies. To make the commonalities of these structures explicit and therefore enable 
comparison, we define a hierarchy as follows (also compare Maedche, 2002): 

Definition 1. A hierarchy is a triple, H = (N, -<, root) whereby A is a non-empty 
set of nodes and A x A is a strict partial ordering defined on the set of 
nodes A. ^ «2 implies that the node «2 is a direct parent of n\ in the hierar- 
chy. Furthermore, <n denotes the ancestor relation: n\ k-h 112 O ^rii^ ,■■■ ,ni^ : 
«i = «ii ^ ■■■ ^ iiir = ” 2 - root G A is the root node of the hierarchy 
(V /2 G A \ {root} : n <t-c root). 

This definition describes hierarchies as directed acyclic graphs (DAG) or poly- 
hierarchies with exactly one root node. A tree or mono-hierarchy is a special DAG, 
in which every node (except the root node) has exactly one parent: V« G A, « ^ 
root : \{n' G A|« ^ n'}\ = 1. As finding appropriate labels for these nodes is a 
challenging subtask on its own (e.g., cluster labeling or concept naming), we will 
assume that a lexicon is assigned to a hierarchy: 

Definition 2. A lexicon assigned to a hierarchy Ti. is a pair Cn = (L, F) whereby 
L is a set of lexical entries intended to describe nodes and F C L x A x (0, 1] is 
the labeling relation {(l,n,d) G F means that / is a label of n with d being the 
descriptiveness of I for «.). 

The descriptiveness of a label can represent, e.g., the confidence of a learning 
algorithm having found several possible labels for a node. If an algorithm always 
assigns a single lexical entry to a node, we call this a strict lexicon. Furthermore, 
learning tasks often require to assign data instances to the learned hierarchy. This 
instance assignment is separately defined as follows: 

Definition 3. An instance assignment to a hierarchy Ti. is a pair T-h = (/, A) 
whereby / is a set of instances and A C / x A x (0, 1] is the assignment rela- 
tion ((;, n,a) G A means that the item i is assigned to node n with the association 
strength a). 

Methods can either allow to assign an instance to more than one node (e.g., 
multi-label classification, soft clustering) or only to a single node (e.g., single- 
label classification, hard clustering). We will refer to the latter as strict instance 
assignment. 

Several learning tasks can include hierarchies. Here, we consider hierarchical 
classification, hierarchical clustering, and ontology learning. In hierarchical classi- 
fication, the classes in which items are classified are hierarchically related. This 
means the nodes in the hierarchy correspond to classes. The hierarchy H itself 
is given in advance and, therefore, is fixed. Unknown is the assignment of new 



Evaluation Strategies for Learning Algorithms of Hierarchies 



85 



instances to the hierarchy. To stick to the previous dehnition, X-h is to be learned 
and evaluated. 

In hierarchical clustering, a hierarchy of clusters is extracted from a dataset. The 
hierarchy itself is therefore learned by the algorithm at the same time as items are 
assigned to it. Both aspects (i.e., Ti, and Xu) need to be evaluated. Furthermore, 
semi- supervised learning is a hybrid of these two distinct tasks, where part of the 
hierarchy is known in advance but can be further extended (Bade & Niirnberger, 
2008). Cluster labeling aims at making an extracted cluster structure more useful 
by naming clusters. This task builds a lexicon Cu in addition to a learned structure 
which is subject to evaluation. 

Ontology learning from text is concerned with learning tasks on different lev- 
els like term extraction, concept extraction or relation learning (Cimiano, 2006). 
As concept hierarchies (defined by the taxonomic relation) are a core component 
of ontologies, they are targeted by many ontology learning methods. Hereby, it 
is important to evaluate the learned hierarchy H as well as the concept labels 
Cn- Depending on the chosen method, the learned concept hierarchy is popu- 
lated with instances. In such case, this instance assignment Xu is also subject to 
evaluation. 



2 Evaluation Strategies in the Literature 

A very common approach for evaluating learning algorithms is to use a dataset as 
gold-standard. The learned output is compared to this gold-standard whereby a per- 
fect match is best. To be able to do this, the dataset must include the desired learning 
result, i.e., a hierarchy. Its major drawback is that it usually cannot take into account 
that there is more than one correct solution. Nevertheless, it is often used and also 
the focus of this work. Other evaluation strategies include the definition of qual- 
ity metrics on the result alone. These are intrinsic values like number of nodes in 
the hierarchy or intra-cluster similarity. Furthermore, a learning result could also be 
evaluated by human assessment. However, this method is very time-consuming and 
furthermore affected by the evaluators’ subjectivity. In the following, we present an 
overview of gold-standard based evaluation approaches from the literature. Please 
note that they are all very focused to a specihc problem. To the best of our knowl- 
edge, no previous work exists that tries to unify evaluation on hierarchies over 
several tasks. 

Several work was published for hierarchical classification. However, only some 
of these consider the hierarchy in their evaluation. An easy procedure is to apply 
standard measures from flat classification on different subsets of classes. E.g., in 
Ceci and Malerba (2003), precision and recall was computed on different hierar- 
chy levels and the error rate was split to distinguish between specialization and 
generalization error. Although these methods provide an idea of the behavior of 
an algorithm, it is often difficult to determine the better of two algorithms. This 
is avoided by giving a single measure. Sun and Lim (2001) extended precision 
and recall to allow for different severeness of a misclassification. They proposed 
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two different measures, one based on category cosine similarity and one based on 
category hierarchy distance. Cai and Hofmann (2004) defined a taxonomy based 
loss motivated by a document filtering setting. They associated different costs to 
a false classification depending on the structural relation between predicted and 
true class. These and other approaches to evaluate hierarchical classification usu- 
ally differ mainly in their strategies of defining a “partially correct” classification. 
In Bade, Hiillermeier, and Niirnberger (2006), we generalized the standard measures 
precision, recall, and accuracy to integrate any strategy through the use of an util- 
ity function. Exemplary, this work used a strategy, which was oriented on user 
interaction with the hierarchy. 

Comparing the structure itself often requires a mapping between the node sets 
as a first step. In the literature, this mapping is performed based on either the 
assigned lexicon or the assigned instances. In flat clustering, clusters can be mapped 
to gold-standard classes, e.g., by maximizing the resulting instance based evalu- 
ation measure. For hierarchical clustering, we proposed such a mapping in Bade 
and Niirnberger (2008) and then applied measures from classifier evaluation. In the 
context of ontology learning, an example of node mapping based on concept labels 
(i.e., the associated lexicon) is Dellschaft and Staab (2006). Brank, Madenic, and 
Groblenik (2006) propose an implicit mapping based on instance assignment. 

For cluster labeling, gold-standard evaluation is rarely used. Treeratpituk and 
Callan (2006) measured exact and partial label matches using precision in the top n 
results and mean reciprocal rank. They took into account identical terms as well as 
synonyms defined in the Wordnet ontology, which should be preferred over human 
assessment of matches. In Bade and Hermkes (2008), we define some measures that 
were inspired from evaluation in ontology learning. Both approaches compare labels 
on a node to node basis without integrating the hierarchy in their evaluation. 

A core aspect of evaluating an ontology learning procedure is to compare con- 
cept hierarchies to a gold-standard. Dellschaft and Staab (2006) review existing 
measures. They postulate a multi-dimensional approach which strictly separates 
the comparison on a lexical and structural level. For both levels, they adapt tra- 
ditional precision and recall. While the lexical measures merely compare directly 
the label sets L of two hierarchies, taxonomic precision and recall integrate the 
structure by extracting characteristic excerpts (consisting of other nodes) for each 
node. These excerpts are then compared using standard precision and recall, and 
are finally aggregated to a global measure. The same paradigm underlies the tax- 
onomic overlap measure proposed by Maedche (2002). Both approaches do not 
consider instance assignment. Brank et al. (2006) complement this by proposing 
the OntoRand-Index, which compares two hierarchies based on how they structure 
a set of common instances. 

On a more abstract level, several algorithms exist that measure similarity of trees 
by the tree edit distance. Bille (2005) provides a good review of existing methods. 
For the general case, computing the tree edit distance is computationally expensive. 
However, for specific cases an efficient computation might be available. Even more 
general are methods dealing with graph similarity (see Schenker, Bunke, Last, & 
Kandel, 2005 for a survey). 
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3 Interdisciplinary Comparison of Evaluation Measures 

Though originating from different disciplines, the presented evaluation strategies 
exhibit strong similarities. In line with the tripartite representation from Sect. 1 , we 
divided evaluation into three dimensions: structural, lexical, and instance assign- 
ment. The structural dimension is concerned with comparing the node sets and 
their hierarchical arrangement. A natural comparison is to require equality of both 
structures and penalize each deviation in a symmetric way. However, sometimes an 
asymmetric evaluation is more appropriate where one hierarchy is allowed to be an 
refinement of the other, e.g., to evaluate whether a fine grained dendrogram extracted 
through hierarchical clustering reflects a more coarse grained gold standard (Bade 
& Niirnberger, 2008). On the lexical dimension, two hierarchies are similar if their 
nodes are described using similar labels. Measures can be distinguished in terms of 
whether they require strict lexicons. The third dimension compares how two hierar- 
chies structure a set of instances. Measures can support strict or non-strict instance 
assignment. 

Theoretically, the given dimensions are independent. Consider for instance a 
hierarchical web directory whose category labels are translated into another lan- 
guage: While being equivalent on the structural and instance assignment level, 
the lexical similarity would decline dramatically. However, as mentioned earlier, 
some learning tasks touch several dimensions. E.g., in clustering, the structure is 
induced by a hierarchical arrangement of instances. According to Dellschaft and 
Staab (2006), it is desirable to evaluate each dimension separately without interfer- 
ence between dimensions. However, if the task already connects several dimensions, 
a measure influenced by the same dimensions can be appropriate. Apart from this, 
they postulate a proportional error effect to represent correctly the degree of fatal- 
ness of an error. E.g., a missing node close to the hierarchy’s root is typically judged 
more fatal than a missing leaf node. Eor this purpose, the output of the measure 
should use the complete available interval (often [0; 1]). Despite these desirable 
properties, the meaning of the measure should not be forgotten. Two measures 
might focus on the same dimension, while considering different view points on the 
problem. 

Distinguishing existing measures in terms of our dimensions, however, gives a 
good starting point. This was done in the upper part of Table 1 for the measures 
from the literature review. The second half contains measures that we propose here 
to complement the given selection. As can be seen, lexical agreement and instance 
assignment can be measured independently of the other dimensions. However, the 
measures often require a mapping of the two node sets as described earlier. Struc- 
tural similarity, on the other hand, can only be measured depending on either the 
lexicon or the instance assignment. This is necessary to identify corresponding loca- 
tions in both hierarchies. It depends on the task at hand to decide which procedure 
is more appropriate. As can be seen, most existing work depends on the lexicon 
while measures based on instance assignment are rare. The Onto-Rand is the only 
measure in this direction, which has some restrictions on its applicability. As eval- 
uation based on instance assignment is a very interesting problem (especially for 
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Table 1 Dimensionality of different hierarchical evaluation measures (0 judged target dimension; 
O influencing dimension; 0 not applicable; otherwise not relevant) 



Dimension 


Lexical 


Structural 


Inst, assignment 


Measure 


Strict 


Non-strict 


Sym. 


Asym. 


Strict 


Non-strict 


Sun and Lim (2001): modified prec./recall 










0 


0 


Cai and Hofmann (2004): 










0 


0 


taxonomy-based loss 
Bade et al. (2006): utility based 
prec./recall/acc. 

Treeratpituk and Callan (2006): label 


0 


0 






0 


0 


matches 














Bade and Hermkes (2008): tb/rb label 


0 


0 










similarity 

Dellschaft and Staab (2006): lexical 


0 


0 










prec./recall 

Dellschaft and Staab (2006): taxonomic 


O 


O 


0 


0 






prec./recall 

Maedche (2002): taxonomic overlap 


O 


O 


0 


0 






Brank et al. (2006): Onto-Rand 






0 


0 


O 


0 


Bille (2005): tree edit distance 


o 


o 


0 


0 






H(l) 






0 


0 


O 


0 


ITP/ITR (4) 






0 


0 


o 


O 


Extended OntoRand (5) 






0 


0 


o 


o 



hierarchical clustering were no lexicon is built), we propose three alternative mea- 
sures. Please note that all these measures assume that the compared hierarchies are 
defined on the same set of instances. We denote the gold-standard with Tig, Cg, and 
Ig and the learned result with TLi, Ci, and Ii . 

First, we propose H-Correlation. Similar to the Rand-Index (Rand, 1971) for flat 
clustering, structure can be evaluated based on item assignment without requiring a 
node mapping. This is also true for the Onto-Rand. However, our measure differs 
in a support for asymmetric comparison as well as in the fundamental Idea, which 
compares instance triples opposed to pairs. The sets of all instance triples r = 
(i\, i 2 , i^) for which i\ and i 2 have a more specific common ancestor ca as i\ and 
h (i.e., ca{i\, h) <n ca{i\, ii)) are compared. In its simplest form, H-Correlation 
can measure the overlap of the two triple sets. This can be extended by weighting 
individual triples differently. The symmetric correlation Hs is defined on the left in 
(1). Furthermore, an asymmetric correlation Ha ((1) right) can be of interest, if one 
hierarchy is allowed to be more detailed than the other. In (1), is allowed to be 
more detailed, whereby 7ii is more detailed than Tig, if Ni D Ng A (Vni , iij e Ng : 
tii <g rij — >• rii <i tij) A rooti >/ rootg. 



Er€7)nr^ iMr) + Wg{x)) 
Er€7) + EreT’jW'jCT) 



Ha = 



^zeTidTg 

Wg{r) 



^zeTg "'g 



Hs 



He 



( 1 ) 
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Here, we propose and compare the following two different triple weights, whereby 
wi gives equal weight to each triple and W 2 gives equal weight to each hierarchy 
node through node specihc normalization: 

Wi(t) = 1 W 2 (t) = l/|{r'|cfl0'i,t3) = Cfl(/i',/ 3 )}|. (2) 

The second measure is an adapted version of taxonomic precision and recall 
(Dellschaft & Staab, 2006). Its original version requires a node mapping between 
both hierarchies, which is done by matching node labels. This mapping is replaced 
by an examination of instance assignment in our instance-based taxonomic pre- 
cision ITP and recall ITR, following the assumption that the instances assigned 
to the nodes in a hierarchy do exhibit a sufficient degree of topic specificity. For 
each instance i , we extract a characteristic excerpt from each hierarchy, namely the 
instance-based semantic cotopy isc(i,Ti.) = {j e I\{j,n,s) € AA(i,m,t) e 
A A (n <T-c mV m Sn n)}. This excerpt contains all instances assigned to the same, 
sub- or super-nodes of the nodes i is assigned to. Two excerpts of an instance are 
compared as in (3), whereby itp measures local precision and itr local recall. The 
global precision and recall values then combine the individual local values as shown 
in (4) for the precision 



itp(i, Hi,Hg) 



\isc{i, "Hi) n isc(i, 'Hg)\ 
\isc(i, Hi)\ 



itr(i, 



(3) 



iTP{ni,Hg) = ^ E (4) 

' ' i€l 

The third measure extends the OntoRand Index to non-strict instance assignment. 
In its original version, the basic idea is to represent a hierarchy as a vector. The 
similarity of two hierarchies is then the similarity between the two vectors. Every 
vector dimension corresponds to a pair of instances i i , t 2 . The entry at the respective 
dimension contains a distance measure S{niJi 2 ) between the nodes «i and «2 to 
which ii and i 2 are assigned. This requires a strict instance assignment. To extend 
the measure to the non-strict case, the 5-function is redefined for node sets instead 
of single nodes as follows: 



S(NuN2)= E tUi,2Hni,n2)- (5) 

n\€N\M2^^2 



The weighting factor ztTi 2 can be used to weight certain node combinations dif- 
ferently, e.g., by their instance association strength. In this work, we assume equally 
distributed association strength, which corresponds to averaging over all possible 
combinations: zzti ,2 = pvflTivjy' Summarizing, the Extended OntoRand Index (EOR) 
is identical to the original version in Brank et al. (2006) except that the 5-function 
is replaced by our 5. 
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4 Experiments and Conclusion 

We performed several experiments to compare the proposed measures in terms of 
their behavior to different types of structural differences. For comparability, we 
picked a scenario, for which all three measure are applicable, i.e., a symmetric com- 
parison with strict instance assignment. However, be reminded that the applicability 
of the different measures varies and, depending on the task, not all measures might 
be available. In order to avoid any kind of bias introduced by a real-world hierarchy, 
we chose to create an artificial dataset as gold standard. It consisted of 190 nodes 
with 50 instances assigned to each node (except root, making it a total of 9,450 
instances). The branching factor varied between 2 and 3, the depth was 5. 

Starting from our gold standard, we created six different experimental setups 
by systematically inducing different structural errors. We then measured the simi- 
larity between the original and the modified hierarchy with each of our proposed 
measures. The results of these experiments are shown in Fig. 1. In setting (a), the 
hierarchy was cut at a certain depth level. Instances assigned to nodes that were 




© 



remaining hierarchy levei 












number of inserted nodes 




number of removed nodes 




ITP - ^ - ITR 



EOR 



Fig. 1 Experimental results: (a) hierarchy cutting, (b) inserting empty intermediate nodes, (c/d) 
removing nodes on level 3/5, (e/f) moving instances/nodes 
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removed were assigned to the closest ancestor node that remained in the hierarchy. 
This scenario was already used by Brank et al. (2006) for their evaluation of the 
OntoRand. It simulates a top-down learning algorithms that stops its rehnement too 
early, e.g., caused by an erroneous stopping criterion. In setting (b), the hierarchy 
was changed to binary by introducing an empty intermediated node combining two 
of three child nodes in the cases of branching factor 3. The diagram varies between 
no change and replacing all 40 three-child-nodes. This setting might occur, e.g., as 
artifact of a clustering procedure. In (c) and (d), nodes were removed either on level 
3 or 5. Their instances and child nodes were assigned to their parent node. Both dia- 
grams are in the same range, which equal the maximum number of nodes on level 
3. In (e) and (f), instances/nodes (with their instances but without their child nodes) 
were moved randomly to a different place in the hierarchy. The diagrams range up 
to about 50% of all instances/nodes moved. 

Setting (a) clearly shows different sensitivity of the algorithms towards the hier- 
archy levels. H 2 is least affected by the hierarchy level and therefore has the largest 
drop in similarity as lower levels have more nodes. The other extreme is H\, which 
hardly looses similarity as long as the top level separation is not violated. The other 
measures lay in between. Regarding ITP and ITR, only ITP reacts in this case. 
ITR would cover the opposite case where the hierarchy would be split further than 
required. Error type (b) cannot be measured with ITP/ITR at all as they ignore the 
empty nodes. The peaks in the H\ and EOR curve are an artifact of the level sensi- 
tivity. A closer look at the data revealed that they occur, when nodes were inserted at 
high levels. Setups (c) and (d) further show this sensitivity. H 2 has an identical curve 
on both levels. The other measures react stronger on the higher level. It holds: The 
less sensitive a measure is for depth (except not sensitive at all), the stronger this 
difference can be seen. In condition (e), all measures show a similar linear decrease, 
except EOR, which seems to be less influenced by pure instance movement. An 
interesting observation in (f) is the fast decrease of Hi, which showed much less 
reaction in the other conditions. We attribute this to the fact that nodes were often 
move to completely different parts of the hierarchy, therefore introducing a signif- 
icant error. Furthermore, it is interesting that EOR, which also has a quite large 
sensitivity to hierarchy level reacts the fewest. This shows that both measures have 
a fundamentally different concept of measuring similarity. Summing up, we want to 
rank our measures according to increasing level sensitivity: H 2 , EOR, ITP/ITR, H\. 
However, this is not a sufficient distinction as we showed highly different behavior. 
There is no clearly best measure and it is an individual decision of which measure 
best reflects the evaluation purpose. An important question to assess is: How many 
specific errors count as much as a single general error? This question is not easy 
to answer. However, increasing sensitivity leads to less smooth behavior of the 
measure. This could be an argument against (too strong) level sensitivity. 

Concluding the paper, we want to point out that the contribution of this paper 
is twofold: First, we assembled an interdisciplinary pool of evaluation methods for 
learning algorithms of hierarchies, embedded in a generic framework. Second, we 
proposed new instance-based measures for structural hierarchy comparison and ana- 
lyzed their properties experimentally. Our experiments describe the characteristics 
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of the measures and can be used as basis for an individual decision according to the 
specific evaluation need. The most obvious difference can be found in the sensitivity 
to hierarchy levels. Using several measures should provide clear evidence how and 
where two hierarchies differ. Our methodology and results lay the groundwork for 
further work in this direction, which we are also pursuing. Furthermore it should 
be noted that all measures suffer from the same problem as the originally proposed 
Rand index. An adjustment as proposed for the Rand index in Hubert and Arabie 
(1985) is necessary to allow comparison of results over different gold standards. 
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Abstract In clustering we often face the situation that only a subset of the avail- 
able attributes is relevant for forming clusters, even though this may not be known 
beforehand. In such cases it is desirable to have a clustering algorithm that automat- 
ically weights attributes or even selects a proper subset. In this paper I study such 
an approach for fuzzy clustering, which is based on the idea to transfer an alterna- 
tive to the fuzzifier (Klawonn and Hoppner, What is fuzzy about fuzzy clustering? 
Understanding and improving the concept of the fuzzifier. In: Proc. 5th Int. Symp. 
on Intelligent Data Analysis, 254-264, Springer, Berlin, 2003) to attribute weighting 
fuzzy clustering (Keller and Klawonn, Int J Uncertain Fuzziness Knowl Based Syst 
8:735-746, 2000). In addition, by reformulating Gustafson-Kessel fuzzy cluster- 
ing, a scheme for weighting and selecting principal axes can be obtained. While in 
Borgelt (Feature weighting and feature selection in fuzzy clustering. In: Proc. 17th 
IEEE Int. Conf. on Euzzy Systems, IEEE Press, Piscataway, NJ, 2008) I already 
presented such an approach for a global selection of attributes and principal axes, 
this paper extends it to a cluster-specific selection, thus arriving at a. fuzzy subspace 
clustering algorithm (Parsons, Haque, and Liu, 2004). 

Keywords Feature selection ■ Feature weighting • Fuzzifier alternative ■ Fuzzy 
clustering ■ Subspace clustering. 



1 Introduction 

A serious problem in distance-based clustering is that the more dimensions (attri- 
butes) a datasets has, the more the distances between data points - and thus also 
the distances between data points and constructed cluster centers - tend to become 
uniform. This, of course, impedes the effectiveness of clustering, as distance-based 
clustering exploits that these distances differ. In addition, in practice often only a 
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subset of the available attributes is relevant for forming clusters, even though this 
may not be known beforehand. In such cases it is desirable to have a clustering 
algorithm that automatically weights the attributes or even selects a proper subset. 

In general, there are three principles to do feature selection for clustering. The 
first is a. filter approach (e.g., Dash, Choi, Scheuermann, & Liu, 2002; Jouve & 
Nicoloyannis, 2005), which tries to assess and select features without any explicit 
reference to the clustering algorithm to be employed. The second is a wrapper 
approach (e.g., Dash & Liu, 2000; Dy Se Brodley, 2000; Butterworth, Piatetsky- 
Shapiro, & Simovici, 2005), which uses a clustering algorithm as an evaluator for 
chosen feature subsets and may employ different search strategies for choosing the 
subsets to evaluate. The hnal approach tries to combine clustering and feature selec- 
tion by pushing the feature selection method into the clustering algorithm (e.g., Roth 
& Lange, 2004; Law, Figueiredo, & Jain, 2004). It should also be noted that any fea- 
ture weighting scheme (which may, in itself, employ any of these three principles) 
can be turned into a feature selection method by simply applying a weight threshold 
to the computed feature weights. 

In this paper I study weighting and selecting features in fuzzy clustering (Bezdek 
& Pal, 1992; Bezdek, Keller, Krishnapuram, & Pal, 1999; Hoppner, Klawonn, 
Kruse, & Runkler, 1999; Borgelt, 2005; Wang, Wang, and Wang, 2004). The 
core principle is to transfer the idea of an alternative to the fuzzifier (Klawonn 
& Hoppner, 2003) to attribute weighting fuzzy clustering (Keller & Klawonn, 
2000). By reformulating Gustafson-Kessel fuzzy clustering (Gustafson & Kessel, 
1979) this can even be extended to a scheme for weighting and selecting prin- 
cipal axes. While the basics of this approach were already introduced in Borgelt 
(2008) for global attribute weighting and selection, this paper extends this approach 
to a cluster-specific operation. By carrying out experiments on artificial as well as 
real-world data, I show that this approach works fairly well and may actually be 
very useful in practice, even though the fact that it needs a normal fuzzy cluster- 
ing run for initialization (otherwise it is not sufficiently robust) still leaves room for 
improvement. 



2 Preliminaries and Notation 

Throughout this paper I assume that as input we are given an -dimensional data set 
X that consists of n data points x j = (xyi, . . . , X;,„), 1 < y < n. This data set may 
also be seen as a data matrix X = (Xjk)\<j <n.\<k<m, the rows of which are the data 
points. The objective is to group the data points into c clusters, which are described 
by m-dimensional cluster centers fit = {ptn , . . . , ptim), 1 < i < C- These cluster 
centers as well as the feature weights that will be derived (as they can be interpreted 
as cluster shape and size parameters) are jointly denoted by the parameter set C. The 
(fuzzy) assignment of the data points to the cluster centers is described by a (fuzzy) 
membership matrix U = (m^ )i<, <c,i<y<„. 



Fuzzy Subspace Clustering 



95 



3 Attribute Weighting 

This section reviews two basic methods to compute attribute weights in fuzzy clus- 
tering. Its main purpose is to contrast these closely related methods and to set the 
stage for the attribute selection approach developed in this paper. 



3.1 Axes-Parallel Gustafson-Kessel Fuzzy Clustering 

A very direct way to determine attribute weights is to apply axes-parallel Gustafson- 
Kessel fuzzy clustering (Klawonn & Kruse, 1997). In this case we have to minimize 
the objective function 



/(X, C, U) = ^ ^ h{uij) ^ CT,./ {Xjk - liikf 

i=i j=\ k=i 



subject to Vi, 1 < i < c : ni"=i k standard constraints Vy, 1 < y < n : 

^ij ~ 1 Vi, 1 < / < c : ^" = 1 Uij > 0. The inverse variances are the 
desired cluster-specific attribute weights, which have to be found by optimizing the 
objective function. The membership transformation function h is a convex function 
on the unit interval. Usually h{ujj) = u‘^j with a user-specified /Mzzi^er a (most 
often a = 2) is chosen, but there are also other suggestions (for example Klawonn 
& Hdppner, 2003). As the methods discussed in this paper work with any choice of 
the function h, its exact form will be left unspecified in the following. The resulting 
update rules are 



Vi ; 1 < i < c : Vy ; 1 < y < « : 



where 



Vi; 1 < i < c : 



Vi; 1 < i < c : V/c; \ <k <m \ 



where 



2 

V.'-“ 









if h{uij) = u'.j, 



^kj 
m 



l^i = 



k=l 



and 



_J- 

'a = 

r=l 

n 

ik = 'Y^Huij){xjk- iiikf. 
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3.2 Attribute Weighting Fuzzy Clustering 

An alternative, but equally simple scheme to obtain attribute weights was suggested 
in Keller and Klawonn (2000). The objective function to minimize is 

c n m 

/(X,C,U) = 

i = l ; = 1 k=\ 

The constraints on the membership degrees Uij are the same as for Gustafson-Kessel 
fuzzy clustering, but the attribute weight constraint now reads Vi; 1 < i < c : 
Wik = 1. The additional parameter v controls the influence of the attribute 
weights in a similar way as the fuzzifier a (as in h(uij) = u“j) controls the influence 
of the membership degrees. The update rules for membership degrees and cluster 
centers coincide with those of Gustafson-Kessel fuzzy clustering. The weights are 
updated according to 



Vi; 1 < / < c ; VA:; I < k < m : 



Wik = 




E m 

r=l 




with sfi^ defined as in the preceding section. By rewriting the update rule of 
Gustafson-Kessel fuzzy clustering (see Sect. 3.1) as 



Vi ; 1 < i < c ; VA:; I < k <m : 



-2 

i.k 



S: 



CT.-t = 



i,k 






the similarities and differences become very obvious: they consist in a different 
normalization (sum instead of product) and the additional parameter v. 



4 Attribute Selection 

The methods reviewed above yield attribute weights either as inverse variances 
or directly as weights Wj^, 1 < i < c, 1 < A: < m. It is important to note that in both 
cases it is impossible that any attribute weight vanishes. Therefore a modification 
of the approach is necessary in order to select attributes (which may he achieved by 
allowing attribute weights to become 0). 

The core idea of the proposed attribute selection method is to transfer the analysis 
of the effect of the fuzzifier a (as in h{uij) = u“j) and its possible alternatives, as it 
was carried out in Klawonn and Hoppner (2003), to attribute weights. As Klawonn 
and Hoppner (2003) showed, it is necessary to apply a convex function /?(•) to the 
membership degrees in order to achieve a fuzzy assignment. Raising the member- 
ship degrees Uij to a user- specified power (namely the fuzzifier a) is, of course, such 
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a convex function, but has the disadvantage that it forces all assignments to be fuzzy 
(that is, to differ from 0 and 1). The reason is that the derivative of this function 
vanishes at 0. If we want to maintain the possibility of crisp assignments, we rather 
have to choose a function h with h'(Q) > 0. 

With the approach of attribute weighting fuzzy clustering it becomes possible to 
transfer this idea to the transformation of the attribute weights. That is, instead of 
raising them to the power v as in Keller and Klawonn (2000), we may transform 
them by 

g(x) = ax^ + (1 — a)x with a e (0, 1]. 

The same function was suggested as an alternative transformation of the member- 
ship degrees in Klawonn and Hbppner (2003), and a fuzzy clustering algorithm was 
derived that allowed for crisp (and thus in particular: vanishing) memberships in 
case the distances of a data point to different clusters differed considerably. Here 
the idea is that the same method applied to attribute weights should allow us to 
derive a fuzzy clustering algorithm that assigns zero weights to some attributes, 
thus effectively selecting attributes during the clustering process. 

However, as was also discussed in Klawonn and Hoppner (2003), the above func- 
tion has the disadvantage that its parameter a is difficult to interpret and thus difficult 
to choose adequately. Fortunately, Klawonn and Hoppner (2003) also provided a 
better parameterization: 



1 + p 1 + p 

Generally, we now have to minimize the objective function 

c n m 

g(Wik){Xjk - ^^ikf 

1=1 7=1 k=\ 

subject to Vi : wit = 1 with g{x) = -I- x where P e [0, 1). The 

constraints on the membership degrees (see Sect. 3.1), of course, also apply. This 
leads to the update rule 



Vi; 1 < / < c : V/:; I < k < m : 



Wik = 



1 



/ 1 -F PiiriiQ — 1) ^_2 
1 „-2 ^i-k 

\ ^r=l;wjr>0 '^i.r 




where m,® = max < k 






> 



P 



\ + P{k-\) 









Here ^’j-) is a function that describes the permutation of the indices that sorts the 
s~'^ into descending order (that is, > ^Tg{i) — •••)• 
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5 Principal Axes Weighting 

A standard problem of attribute weighting and selection approaches is that cor- 
related attributes will receive very similar weights or will both be selected, even 
though they are obviously redundant. In order to cope with this problem, an 
approach in the spirit of principal component analysis may be used: instead of 
weighting and selecting attributes, one may try to find (and weight) linear combina- 
tions of the attributes, and thus (principal) axes of the data set. This section shows 
how the methods of Sect. 3 can be extended to principal axes weighting by reformu- 
lating Gustafson-Kessel fuzzy clustering so that the specification of the (principal) 
axes and their weights is separated. 



5.1 Gustafson-Kessel Fuzzy Clustering 

Standard Gustafson-Kessel fuzzy clustering uses a (cluster-specific) Mahalanobis 
distance, which is based on cluster-specific covariance matrices S, , f = 1, . . . , c. 
The objective function is 



which is to be minimized subject to the constraints Vf; 1 < f < c : |S~*| = 1 
(intuitive interpretation: fixed cluster volume) and the standard constraints V j, 
1 < 7 < « : Yfi=\ ^ij ~ 1 Vf, 1 < i < c : Yfj = \ “u > The resulting 
update rule for the covariance matrices E, is 



In order to obtain explicit weights for (principal) axes, we observe that, since the E, 
are symmetric and positive definite matrices, they possess an eigenvalue decompo- 
sition E/ = R,D?r 7 with D; = diag(CT;,i, . . . , a,-,m) (i.e., eigenvalues to af^) 
and orthogonal matrices R, , the columns of which are the corresponding eigenvec- 
tors.' This enables us to write the inverse of a covariance matrix E, as E“* = T, T^^ 
with T, = R, D“' . As a consequence, we can rewrite the objective function as 



* Note that the eigenvalues of a symmetric and positive definite matrix are all positive and thus it 
is possible to write them as squai'es. 



c n 



/(X, C, U) = ^ ^ h{uij){x j - fxj - fii), 



i=i 7=1 



Vi; 1 < ; < c : E,- = S, |S, | 



n 
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c n 

J{X, C, U) = ^ ^ h(uij){xj - (lifTiTj (xj - /i,) 

i=\ j=i 
c n 

= E E - /i,-rR/D-')((xy - 

i=l 7=1 

c n m m ^ 

= E E E°^a ( “ l^idri.ik) ■ 

i = l 7=1 k=\ l=\ 



In this form the scaling and the rotation of the data space that are encoded in 
the covariance matrices S, are nicely separated: the former is represented by the 
variances afj^, k = 1, . . . , m (or their inverses crf^), the latter by the orthogonal 
matrices R, . In other words: the inverse variances a~^ (the eigenvalues of pro- 
vide the desired axes weights, while the corresponding eigenvectors (the columns 
of R, ) indicate the (principal) axes. 



5.2 Reformulation of Gustafson-Kessel Fuzzy Clustering 



In order to transfer the approach of Keller and Klawonn (2000) and the one devel- 
oped in Sect. 4, we start from the rewritten objective function, in which the scaling 
and the rotation of the data space are separated and thus can be treated indepen- 
dently. Deriving the update rule for the scaling factors (j~^ is trivial, since basically 
the same result is obtained as for axes-parallel Gustafson-Kessel fuzzy clustering 
(see Sect. 3.1), namely 



m 



r = l 



with the only difference that now we have 

n m 2 

^Ik = ~ f'iiyi.lk^ ■ 

7 = 1 l = l 

Note that this update rule reduces to the update rule for axes-parallel Gustafson- 
Kessel clustering derived in Sect. 3.1 if R; = 1 (where 1 is an m x m unit matrix), 
which provides a simple sanity check of this rule. 

In order to derive an update rule for the orthogonal matrix R; , we have to take 
into account that in contrast to how the covariance matrix S, is treated in normal 
Gustafson-Kessel fuzzy clustering, there is an additional constraint, namely that 
R, must be orthogonal, that is, R,^ = R~*. This constraint can conveniently be 
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expressed by requiring R/R,^ = 1- Incorporating this constraint^ into the objective 
function yields the Lagrange functional (see Golub and Van Loan (1996) for the 
second term) 

c n 

£(X, C, U, L) = ^ ^ - infRD-^f 

i=i j=i 

C 

+ y^trace(A, (l - R/R^)), 

1=1 

where L = {Ai, . . . , A^} is a set of symmetric m x m matrices of Lagrange 
multipliers and trace(-) is the trace operator, which for an m x m matrix M is 
dehned as trace(M) = Yl'k=i^kk- The resulting update rule^ for the rotation 
matrices is R, = O;, where O, is derived from the eigenvalue decomposition 
of S,- = /i(My )(x; — fii)(Xj — fii)^ , that is, from S,- = 0,E?0T where 

Ej = diag(e, 1 , . . . , e,_m) is a diagonal matrix containing the eigenvalues. 



6 Principal Axes Selection 

In analogy to the transition from attribute weighting (Sect. 3) to attribute selec- 
tion (Sect. 4), it is possible to make the transition from (principal) axes weighting 
(Sect. 5) to (principal) axes selection (this section): we simply replace the update 
rule for the weights (which are now separate from the axes) with the one obtained 
in Sect. 4. This leads to the update rule 



Vi; 1 < / < c : Vk; I <k < m : 



Wik = 



with Si^k defined as in Sect. 5.2 and m,® as dehned in Sect. 4. 



7 Experiments 



Of all experiments I conducted with the described method on various data sets, I 
report only a few here, due to limitations of space. Since experimental results for a 
global weighting and selection of attributes can be found in Borgelt (2008), I conhne 
myself here to cluster-specihc attribute weighting and selection. 



^ Note that, in principle, the orthogonality constraint alone is not enough as it is compatible with 
|R,j = —1, while we need |R,j = 1. However, the unit detenninant constraint is automatically 
satisfied by the solution and thus we can avoid incorporating it. 

^ Note that this rule satisfies |R| = 1 as claimed in the preceding footnote. 
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Fig. 1 Artificial data set with three Gaussian clusters and 300 data points 
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Fig. 2 Artificial data set with two Gaussian clusters and 200 data points 




/8 = 0.5 selects attributes 2, 10, and 
13 (one attribute per cluster). 

= 0.3 selects the attribute sets 
{7,10,12}, {6,7, 12, 13}, and {2}. 
Clustering the subspace spanned by 
attributes 7, 10 and 13 yields: 



^ = 0.3 
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W2.k 
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1.00 
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Fig. 3 The wine data set, Blake and Merz (1998) a real-world data set with three classes and 178 
data points. The diagram shows attributes 7, 10 and 13 



Figures 1, 2 and 3 show two artificial and one real-world data set (Blake and 
Merz, 1998) and the clustering results obtained on them. In all three cases the 
algorithm was initialized by axes-parallel Gustafson-Kessel fuzzy clustering (see 
Sect. 3.1), which was run until convergence. (Without such an initialization the 
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results were not quite stable.) As can be seen from these results, the method is 
promising and may actually be very useful in practice. In all three cases uninforma- 
tive attributes were nicely removed (received weights of zero or coefficients close 
to zero), while the informative attributes received high weights, which nicely reflect 
the structure of the data set. 



8 Summary 

In this paper I introduced a method for selecting attributes in fuzzy clustering that 
is based on the idea to transfer an alternative to the fuzzifier, which controls the 
influence of the membership degrees, to attribute weights. This allows the attribute 
weights to vanish and thus effectively selects and weights attributes at the same 
time. In addition, a reformulation of Gustafson-Kessel fuzzy clustering separates 
the weights and the directions of the principal axes, thus making it possible to extend 
the scheme to a weighting and selection of principal axes, which helps when deal- 
ing with correlated attributes. Using this scheme in a cluster-specific fashion yields 
a fuzzy subspace clustering approach, in which each cluster is formed in its own 
particular subspace. 



Software 

The program used for the experiments as well as its source code can be retrieved 
free of charge under the GNU Lesser (Library) Public License at 

http://www.borgelt.net/cluster.html 
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Motif-Based Classification of Time Series 
with Bayesian Networks and SVMs 



Krisztian Buza and Lars Schmidt-Thieme 



Abstract Classification of time series is an important task with many challeng- 
ing applications like brain wave (EEC) analysis, signature verification or speech 
recognition. In this paper we show how characteristic local patterns (motifs) can 
improve the classification accuracy. We introduce a new motif class, generalized 
semi-continuous motifs. To allow flexibility and noise robustness, these motifs may 
include gaps of various lengths, generic and more specific wildcards. We propose 
an efficient algorithm for mining generalized sequential motifs. In experiments on 
real medical data, we show how generalized semi-continuous motifs improve the 
accuracy of SVMs and Bayesian Networks for time series classification. 

Keywords Bayesian networks ■ Motifs • SVM • Time series. 



1 Introduction 

Many phenomena are quantitatively changing continuously in time, like blood pres- 
sure or body temperature of a person, exchange rates of currencies, speed of a 
car, etc. Making observations regularly results in a sequence of measured values 
(usually real numbers). We call such a sequence a time series. 

We illustrate classification of time series on an example. Suppose we are trading 
with currencies, and we know the past exchange rates for some currencies. 

We are interested in the exchange rates in the future. Now we can group curren- 
cies into three classes based on how the exchange rate changes in the near future 
(next month): (1) the exchange rate increases at least by 5%, (2) the exchange rate 
decreases at least by 5%, (3) the exchange rate does not change significantly. Based 
on the times series representing exchange rates in the past, we would like to predict 
exchange rates for the next month, i.e., we want to classify time series of currencies 
in one of the previously defined classes. 
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In general, classification of time series is an important task related to many 
challenging practical problems like indexing of handwritten documents 
(Ratanamahatana & Keogh, 2004b; Manmatha & Rath, 2003), speech recognition 
(Sakoe & Chiba, 1978), signature verification (Gruber, Coduro, & Sick, 2006), anal- 
ysis of brain wave (EEG) signals (Marcel & Millan, 2007) or medical diagnosis 
(Knorr, 2006b). 

In this paper we focus on the classification of time series based on recurrent 
patterns, called motifs. We show that time series classification models based on 
SVMs and Bayesian Networks can be improved using motifs. As the main contri- 
bution we introduce a new motif class: a common generalization of continuous and 
non-continuous sequential mofits, called generalized semi-continuous motifs. We 
propose an efficient algorithm for mining generalized sequential motifs. We com- 
pare generalized semi-continuous motifs to existing motif classes, and show that 
generalized semi-continuous motifs outperform other classes of motifs in the time 
series classification task. 



2 Related Work 



Motif discovery. The task of motif ( or pattern ) discovery in time series is under- 
stood in slightly different ways in the literature. Yankov, Keogh, Medina, Chiu, and 
Zordan (2007) and Patel, Keogh, Lin, and Lonardi (2002) define the task of motif 
discovery regarding one “long” time series: the target of their work is to identify 
recurrent patterns, i.e., approximately repeated parts of the given time series. In 
contrast, Eutschik and Carlisle (2005) are concerned with a set of time series. They 
apply global patterns: they cluster times series, and calculate the “compromise” time 
series for each cluster. Such a “compromise” time series is regarded as a represen- 
tative pattern of the time series in the cluster. Jensen, Styczynski, Rigoutsos, and 
Stephanopoulos (2006) and Eerreira, Azevedo, Silva, and Brito (2006) also use clus- 
tering, however in a more local fashion: they do not cluster the whole sequences, but 
subsequences of them. 

Predefining a (minimal) length L for motifs, scanning the database, and enu- 
merating (almost) all the subsequences of the given length L is common in the 
biological domain (Yankov et ah, 2007; Eerreira et ah, 2006; Jensen et ah, 2006; 
Patel et ah, 2002). However this may not be efficient enough, especially if complex 
motifs with gaps and/or taxonomical wildcards are to be discovered (see Fig. 1). 
For noise-robust motif detection (without wildcards) Buhler and Tompa (2002) use 
random projections. A more sophisticated solution is based on the antimonotonicity 
originally observed in the frequent itemset mining community (Agrawal & Srikant, 
1994), namely, that subpatterns of a frequent pattern are also frequent. State of 
the art frequent sequence mining algorithms are based on antimonotonicity (Gaul 
& Schmidt-Thieme, 2001; Bodon, 2005). It suggests, roughly speaking, that one 
discovers “short” motifs first and somehow “grows” them step by step together to 
longer ones. Approaches based on the antimonotonicity avoid processing of many 
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Fig. 1 Specific and generic 
wildcards build a taxonomy 
of symbols 
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redundant (i.e., non-motif) subsequences. They have intensively been researched 
resulting in highly efficient implementations (Borgelt, 2003, 2004). For databases 
of different character there are special algorithms: for example in case if many 
short motifs are expected, they can efficiently be discovered (“grown together”) 
in a breadth-first search manner like in Bodon (2005), if long motifs are expected, 
approaches of a depth-first search fashion are preferable, like PrefixSpan (Pei, Han, 
Wang, Pinto, Chen, et al., 2004). 

In this work, we mean by motifs approximately repeated local patterns. We pro- 
pose an approach based on pattern mining techniques in order to discover motifs 
regarding a set of time series. 

As usual in time series motif detection (Knorr, 2006a, 2006b; Patel et al., 2002), 
as preprocessing step, we turn time series into a sequence of discrete symbols using 
Symbolic Aggregate Approximation (Lin, Keogh, Lonardi, & Chiu, 2003). In our 
experiments we use 10 (exp. 1) and 7 (exp. 2) different symbols, we aggregate on 
an interval of length 4. 

According to this representation, different classes of motifs can be defined 
regarding generality and character. Regarding generality, we distinguish {l)flat pat- 
terns (without wildcards), and (2) patterns with taxonomical wildcards (see Fig. 1), 
called generalized patterns. Regarding character we distinguish between (1) Set 
Motifs (the order of symbols is omitted), (2) Sequential Motifs (continuous and 
non-continuous ones), and (3) Semi-continuous Sequential Motifs, which is a com- 
mon generalization of continuous and non-continuous motifs: in semi-continuous 
motifs maximal n gaps of each of a maximal length d is allowed. (For d = n = 0 
semi-continuous motifs are identical to continuous motifs; for d = n = oo they 
are the same as non-continuous motifs.) Table 1 provides a systematic overview of 
selected works on pattern mining. 

For the task of discovering flat patterns optimization techniques (recursive count- 
ing and recursion pruning) were introduced in Borgelt (2003, 2004). To the best 
of our knowledge they have not been generalized for patterns with taxonomic 
wildcards yet. Ferreira and Azevedo (2005) have already allowed gaps in motifs, 
however without taxonomical wildcards. They discovered motifs by enumerating 
all subsequences (of a given length), they have not used an algorithm exploiting pat- 
tern mining techniques. 

Motif-based classification. Motifs have been used for sequence classification in 
biological domain (Dzeroski, Slavkov, Gjorgjioski, & Struyf, 2006; Kunik, Solan, 
Edelman, Ruppin, & Horn, 2005; Ferreira & Azevedo, 2005). This is usually done 
in two steps: (1) first motifs are extracted, then (2) each time series is represented 
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Table 1 Systematic overview of selected pattern mining related work 





Flat patterns 


With taxonomy 


Set motifs 
(item sets) 


Agrawal and Srikant (1994) 
Borgelt (2003, 2004) 


Flipp, Myka, Wirth, and Gntzer (1998) 
Pramudiono and Kitsuregawa (2004) 
Sriphaew and Theeramunkong (2002, 2004) 


Sequential motifs 

(continuous, 

non-cont.) 


Bodon (2005) 


Gaul and Schmidt-Thieme (2001) 
Srikant and Agrawal (1996) 


Semi-continuous 
sequential motifs 


Ferreira and Azevedo (2005) 


This work 




IfcrtHfcnltr 









Fig. 2 Representation of time series as an attribute vector using motif features 



as an attribute vector using motifs so that a classifier like SVM (Kunik et al.), 
Naive Bayes (Ferreira & Azevedo), Decision Tree (Dzeroski et al.), etc. can be 
applied. Some possible ways of construction of attributes are (1) there is a binary 
attribute for each motif, which indicates if the motif is contained in the time series 
or not (Dzeroski et al.; Knorr; Kunik et al.) (see Fig. 2), (2) aggregating attributes 
may indicate the total count and/or average length of motifs occurring in a time 
series (Ferreira & Azevedo). 

Time series classification. As a baseline in our experiments we have chosen the 
Nearest Neighbour approach with DTW as distance measure. DTW is basically 
an edit distance, allowing stretching of time series. It was introduced in Sakoe 
and Chiba (1978). There are recent works on DTW, to make it more accurate 
(Ratanamahatana & Keogh, 2004b), and speed up to data mining applications 
(Keogh & Pazzani, 2000). It has recently been studied from a theoretical-empirical 
point of view (Ratanamahatana & Keogh, 2004a). Furthermore there is some recent 
work that suggests that DTW is the best solution for some time series classification 
problems (Rath & Manmatha, 2003). Thus, DTW is state of the art. 



3 Discovery of Generalized Semi- Continuous Motifs 

To solve the time series classification task, first we search for motifs in the time 
series. In this section we describe our motif discovery approach in detail. We 
suppose that time series are converted to a sequence of discrete symbols. Motif 
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discovery means finding frequent subsequences in the dataset consisting of time 
series sequences. 

Definitions. Given a database of time series D, a set of symbols S and a taxonomic 
relation Tj; over S, the maximal number of gaps n, the maximal length of gaps d, 
and a minimum support threshold s. A symbol x e S matches an other symbol 
y G 2 if either x = y or y is a descendant of x according to T^., x is matching 
symbol, y is called matched symbol. A sequence of symbols m semi-continuously 
matches a times series t e £), if all symbols of m match at least one symbol in 
t so that ( 1 ) the order of matched symbols are the same as the order of matching 
symbols, and (2) the matched symbols in t build a basically continuous sequence 
but maximal n gaps with maximal length d are allowed. A sequence of symbols m 
is called semi-continuous motif, if it matches at least s time series in T . The number 
of matched time series is called support of m . 

Checking the support of all possible sequences whether they are motifs or not, is 
not feasible as it features an inherent unaffordable high computational cost due to 
the large number of possible sequences. Thus, we need to prune the search space in 
order to reduce the number of sequences to be checked and need an efficient imple- 
mentation for checking supports. We adapt and combine constraints in Agrawal 
and Srikant (1994) and in Sriphaew and Theeramunkong (2002, 2004) for semi- 
continuous generalized motifs, and we extend optimization ideas in Borgelt (2003, 
2004) for generalized semi-continuous motifs. 

The basic intuition behind the constraints on generalized semi-continuous motifs 
is the use of the antimonotonous property of the support function: (1) if a sequence 
p includes another sequence p' , the support of p is less than or equal to the support 
of p' , (2) if a sequence p is less general than some other sequence p' , the support 
of p is less than or equal to the support of p' . 

These conditions hold as the number of time series matching p can not be higher 
than the number of time series matching p' , as every time series t matching p also 
matches p' . The following formalization of these constraints define how inclusion 
and generalization are exactly meant. Both constraints are consequences of the def- 
inition of the support (and semi-continuous matching). Let be a sequence over 
S (eventually including taxonomic wildcards): p = (wi,W 2 , . . . , Wk-i, Wk), each 
w; e S, 0 < / <k. 

Constraint 1. Let p' be subsequence of p\ p' = {wj ,Wj+\,Wj+ 2 , ■ ■ - Wj+i), 
0 < j < j 1 Sk. In this case support(p) < support(p'). 

Constraint 2. Denote the transitive closure of the taxonomic relation 7s with 
(i.e., (x, y) e means x is a descendant of y in the taxonomy). Suppose p' is a 
more general sequence than p\ p' = (wj , Wj, . . . , with Vi : (w,- , wj) e 

T^, 0 < i < /c. In this case support(p) < support{p'). 

These constraints suggest to check the shorter and more general sequences first 
for being motifs or not. For example, if we are given the taxonomy in Fig. 3, pattern 
(G, H) would be checked before (G, w) and (G, H, H). 

For motif mining we use a significantly extended version of the algorithm Apriori 
(Agrawal & Srikant, 1994). The Apriori algorithm essentially iterates over three 
steps: (1) Candidate generation: Based on motifs found in the previous iterations. 
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Fig. 3 (a) An example taxonomy with two roots, (b) A simplified overview of the data-structure 
for counting candidates and storing motifs. Straight arrows denote sequential children, lines denote 
taxonomic children. One of the cross-pointers is depicted with dotted arrow. Curved arrows show 
the recursion steps in the double recursive search scheme 



some new sequences will be chosen in order to be checked for support. They are 
candidates. (2) Support counting: The support of each candidate will be determined. 
(3) Filter infrequent candidates: The candidates with less support than the given 
threshold s are deleted. The other ones are motifs. 

The computational cost of Apriori highly depends on the applied data structure. 
Tries have been shown to be efficient (see Borgelt, 2003). In a trie a path from the 
root to a node encodes a sequence. A simplified view of our datastructure is shown in 
Fig. 3. For example the path {root. G, J, H. w) encodes the sequence (/, w). There 
are two different types of edges in this path: there are taxonomical and sequential 
edges (straight lines and straight arrows). Sequential edges are {root.G) and 
Each sequential edge denotes a new symbol of the sequence. The taxonomical edges 
in the path specialize the symbols. In this path G was specialized to J , and H was 
specialized to w. 

There are also “cross-pointers” in the data structure, pointing from a sequence 
prefix (w\ . W 2 , . . . , Wk) to the sequence prefix (w 2 , W 3 , . . . , wt). To keep the exam- 
ple simple, only one cross pointer is depicted (dashed arrow). These “cross-pointers” 
allow quick candidate generation. 

Candidate generation: Let Cr denote the count of roots in the taxonomy T-£. At the 
beginning of the first iteration, there are Cr candidates, one for each root. These are 
the most general and shortest sequences (they consist of one item). After counting 
the support, the candidates for the next iteration are always calculated based on the 
motifs already found. As application of Constraint 2, a motif p of length 1 may have 
any concretizing extensions p' conform to the given taxonomy. For longer motifs: 
knowing that the sequence p = (w \ , W 2 , . . . , Wk -\ , wt) is a motif, its concretizing 
extension p' = {w\,W 2 , ■ ■ ■ .Wk-x.w'f}, {w'j^.Wk) e may only be a motif if 
Pi = {w 2 , ■ ■ ■ , Wk-i , w^) is a motif. 

As application of Constraint 1 , a motif p of length 1 may sequentially be extended 
by any other motif pu of length 1. Thus the new candidate p" is generated. For 
longer motifs: knowing that the sequence p = {w\, W 2 , . . . , Wk-\. Wk) is motif. 
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support_count ( ) { 

for each time series (denoted by s) of input 

support_count_l (candidates . root, s, max_dist) ; 
if (s did not support any node) delete s from input; 

} 

support_count_l (TrieNode n, TimeSeries s, int allowed_dist_to_next_match) { 
item f irst_symbol = s.firstO; 

TailTimeSeries tail_sequece = s . tailTimeSeries ( ) ; 

TrieNode nl = sequential child node of n such that 

the incoming edge to nl matches first_syiiibol; 
if (exists nl) && (exists a candidate reachable over nl ) 

(*) N = set of nl and all of the taxonomic descendants of nl matched by first_symbol 
for each node (denoted by nO ) of N 

if (nO candidate) && (nO has not been supported by this input sequence before) 
nO . incrementSupport ( ) ; 

support_count_l (nO, tail_sequence, max_dist) ; 

( * ) if check ( allowed_dist_to_next_raatching) 

support_count_l (n, tail_sequence,new_allowed_dist_to_next_matching) ; 

} 

Fig. 4 Pseudo-code of one of the main steps of our algorithm: support counting, lines marked with 
asterisk contain generalization compared to the case of flat motifs 



its sequential extension p" — {w\,W 2 , ■ ■ ■ ,Wk,Wk+\), may only be motif if 
Pa = {w 2 , . . . , Wk-\,Wk, Wk+i) is motif. When generating candidates, we always 
apply all possible sequential and concretizing extensions. This has the advantage 
that we always know in advance whether the sequences />, and pa are motifs. 
Furthermore, the cross-pointers have to be updated as well. 

Support count: The dataset of time series is processed sequentially, one sequence at 
a time. For each node the trie contains a counter. For each time series, the counters 
of the matched candidates are incremented. 

Matched candidates can be found efficiently using a double recursive search 
scheme. The doubly recursive search scheme is shown in Figs. 3 and 4. 

When processing a time series, the function for counting is first invoked for the 
root. (The current node is the root.) Then, this function is recursively invoked with 
the tail of the current time series for (1) such a sequential child of the current node 
which match the first item of the time series, and for (2) all matching taxonomic 
descendants of the matching sequential child (3) for the current node as well. Note 
that this step is a generalization of the corresponding step in Borgelt (2003). 

In contrast to Borgelt (2003), in our case we need some additional administration. 
During this double recursive search, we have to take into account (and eventually 
not invoke some of the recursion steps because of) (1) the number of “gaps” that 
the (d, n) semi-continuous candidate has already had in the input sequence till the 
current position, and (2) (if there is currently a “gap” in the matched time series) the 
length of the current “gap” till the current position. We also have to take care of not 
incrementing the counter of a node twice while processing a time series. (Note that 
it is possible to arrive several times at the same node as a candidate may be matched 
by several parts of a time series.) To further increase the efficiency, we use a pruning 
technique similar to the one described in Borgelt (2004), i.e., pruning those subtrees 
which do not contain candidates. 
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4 Experimental Evaluation 

Dataset. The data used in our experiments was collected at the Fresenius Clinics. It 
contains recordings of dialysis sessions for 725 patients. The patients have to consult 
the doctor for treatment regularly, some data (like blood pressure, body tempera- 
ture, . . . ) is recorded every time, this leads to a sequence of observations. We have 
about 40 time series per patients. Some pieces of master data of the patients (like 
age, sex, body mass index, . . . ) are also available. There are two groups of patients: 
“normal” (53%) and “risky” (47%). We use the same dataset as in Knorr (2006a, 
2006b). We refer to Knorr (2006a) for a more detailed description. 

Experimental settings, motif selection. We discover motifs on different time series 
separately (i.e., separately on the time series of blood pressure, body tempera- 
ture, . . .). Minimum support threshold was set to 0.06 (exp. 1) and 0.05 (exp. 2). 
Similar to Knorr (2006b) we select the best predictive motifs for each class: we 
choose motifs that predict the “normal” class with a probability of 90% (exp. Ic), 
85% (exp. la,b), 80% (exp. 2) and the motifs that predict the “risky” class with a 
probability of 85% (exp. Ic), 80% (exp. la,b), 75% (exp. 2). Furthermore we only 
select motifs that are statistically significant for a class (/^ test, a = 0.05) and limit 
the total number of apriori-iterations to 10 (exp. 1) and 20 (exp. 2) in order to get 
local motifs. In exp. 1 we limit the minimum length of motifs as well to three sym- 
bols. Among the motifs fulfilling these criteria, in exp. 1. we only select five motifs 
for each of the 40 different kind of time series the following way: first we select 
the motif mo, which is supported by the most of the time series. Then we perform 4 
iterations and always select the motif mt {k = 1,2, 3, 4), which is supported by the 
most of such time series, which do not support mo , .... mt-i. 

As central classifier we use the WEKA-implementation of SVM-s (with RBF 
kernel) and Bayesian Networks. The parameters of the SVM-s (complexity con- 
stant and exponent) are learned in 2“'°...2^ and 2“^°...2" using a hold-out 
subset of training data in five-fold-crossvalidation protocol. We perform ten-fold- 
crossvalidation (i.e., the full dataset is split into test and train set ten times). Motifs 
are discovered on the training set. Then time series of the test set are checked 
whether they contain the motifs discovered on the training set. Note that our experi- 
mental protocol differs from the one used in Knorr (2006b), thus our results are not 
directly comparable. 

To calculate the baseline, in exp. 1. we use time series aggregates data (min., 
max., avg.) as input of the central classifier. We extend it with CPMs attributes 
(CPM = Count of Predictive Motifs for each class). In exp. 2. we use both master 
data and time series aggregates (baseline). We extend it with features indicating the 
containment of each motif like in Fig. 2 and with CPMs attributes as well. We also 
compare to an SVM classifier integrating kNN-DTW predictions and master data 
(MD). 

Results. In exp. I . we compare different subclasses of semi-continuous generalized 
motifs [(1) continuous, (2) semi-continuous with max. 2 gaps of max. length 1, 
(3) motifs with taxonomical wildcards using a simple taxonomy like in Fig. 1 but 
without *]. The results (Table 2) show that both gaps and taxonomical wildcards 
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Table 2 Impact of different generalized semi-continuous motif subclasses (exp. 1) 





Without 

motifs 


(a) Continuous 
motifs (counts) 


(b) (1,2) Semi-cont. 
motifs (counts) 


(c) Motifs with tax. 
wildcards (counts) 


SVM 


66.52 


65.07 


68.24 


68.13 


SVM (logistic) 


65.97 


66.28 


67.00 


68.39 


Bayesian Network 


70.11 


70.51 


71.04 


69.37 



Table 3 Impact of motifs if master data (MD) is available (exp. 2) 




Baseline 
(without motifs) 


Motifs 

(counts) 


Motifs 

(counts - 1 - indicators) 


SVM 


72.40 


72.12 


75.61 


SVM (logistic) 


72.38 


73.35 


76.43 


Bayesian Network 


73.84 


74.76 


74.76 


SVM on kNN-DTW -|- MD 


73.7 (Knorr, 2006a) 







are beneficial for classification accuracy. Exp. 2. (Table 3) shows, that motifs are 
beneficial in a realistic scenario (where master data is also available) as well. 



5 Conclusion 

We introduced a new class of motifs, generalized semi-continuous motifs. We pro- 
posed an efficient algorithm to discover them, and we showed that these motifs 
improve the accuracy of time series classification. As future work we would like to 
compare different subclasses of generalized semi-continuous motifs in more detail 
and deal with parameter learning. 
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A Novel Approach to Construct Discrete 
Support Vector Machine Classifiers 



Marco Caserta, Stefan Lessmann, and Stefan VoB 



Abstract Discrete support vector machines (DSVM) are recently introduced clas- 
sifiers that might be preferable to the standard support vector machine due to a 
more appropriate modeling of classification errors. However, this advantage comes 
at the cost of an increased computational effort. In particular, DSVM rely upon a 
mixed-integer program, whose optimal solution is prohibitively expensive to obtain. 
Therefore, heuristics are needed to construct respective classifiers. This paper pro- 
poses a novel heuristic incorporating recent advances from the field of integer 
programming and demonstrates its effectiveness by means of empirical experimen- 
tation. Furthermore, the appropriateness of the DSVM formulation is examined 
to shed light on the degree of agreement between the classification aim and its 
implementation in form of a mathematical program. 

Keywords Classification • Meta-heuristics ■ Mixed-integer programming ■ Support 
vector machines. 



1 Introduction 

This paper deals with constructing mathematical models that enable a categoriza- 
tion of objects into a priori known classes. The objects are characterized by a set 
of attributes, which are assumed to affect group membership. However, the precise 
relationship between attributes and class is unknown and has to be estimated from 
a training dataset of labeled examples. A classifier can thus be defined as a func- 
tional mapping from objects x to classes y, which is derived from a training set 
by invoking an induction principle. Classification (i.e., the process of building and 
applying classifiers) is a general concept with applications in various fields such as 
medical diagnosis (e.g., detecting the presence or absence of a particular disease 
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based on clinical tests), text classification (e.g., automated categorization of docu- 
ments), managerial decision making (e.g., credit approval to loan applicants), and 
many more. 

Structural risk minimization (SRM) (Vapnik, 1995) represents such an induction 
principle and strives to construct classification models that balance the conflicting 
goals of high accuracy and low complexity. These two objectives can be imple- 
mented in the form of a mathematical program, whose solution gives a classification 
model in the spirit of SRM. The support vector machine (SVM) (Vapnik, 1995) rep- 
resents the most popular implementation of SRM and has proven its potential in 
several benchmarking experiments (see, e.g., Baesens et al., 2003; Van Gestel et al., 
2004). 

Orsenigo & Vercellis (2004) point out that the particular way SVMs model 
misclassifications is approximate and may deteriorate classification performance. 
They argue that an accurate (i.e., discrete) measurement of classification error is 
more aligned with the SRM principle and should thus enable better predictions. 
Following this line of reasoning, the DSVM has been developed, which incorpo- 
rates a step function to account for misclassifications during training (Orsenigo & 
Vercellis, 2004). DSVMs have been extended in subsequent work to, e.g., mini- 
mize the number of employed attributes (Orsenigo & Vercellis, 2003), incorporate 
fuzzy class-membership functions (Orsenigo & Vercellis, 2007a) or soften the mar- 
gin of classification (Orsenigo & Vercellis, 2007b). Highly encouraging results have 
been observed in each of these studies, suggesting that DSVMs are an interesting 
alternative to standard SVMs and other methods. 

Constructing a DSVM classifier requires the solution of a mixed integer program 
(MIP), whereby integrality constraints originate from the discrete error measure- 
ment, key to any DSVM formulation. As a result, classifier training is notoriously 
more difficult for DSVMs than conducting the respective task for SVMs. In partic- 
ular, the underlying optimization program may not be solved to optimality for any 
reasonably sized dataset and therefore requires techniques from the held of heuristic 
search. Whereas tabu search has been used in earlier work (Lessmann et al., 2006; 
Orsenigo & Vercellis, 2004), later studies rely upon a linear programming (LP) 
based heuristic (LPH), originally developed in Orsenigo & Vercellis (2003). Given 
the pivotal importance of efficient and effective search procedures for DSVMs, it is 
surprising that little effort has been invested to improve upon this approach. Conse- 
quently, the objective of this paper is to design a novel heuristic to construct DSVM 
classifiers that utilizes recent advancements from the field of MIP. The effectiveness 
of the novel method is scrutinized within an empirical study, contrasting DSVM 
classifiers as produced by the LPH and the new approach. As a second contribution, 
the availability of this alternative training algorithm allows examining the degree of 
correspondence between the solution of the underlying optimization program and 
the predictive accuracy of the resulting classifier. This aspect has not been consid- 
ered in previous research, but is of pivotal importance, not only for the development 
of novel optimization algorithms, but for confirming - or challenging - the general 
appropriateness of a particular mathematical formulation towards classification. 
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The paper is organized as follows: The next section introduces the standard SVM 
as well as its discrete counterpart, before the novel MlP-based heuristic is described. 
Subsequently, empirical experiments are conducted in Sect. 3 to evaluate the new 
approach. Conclusions are drawn in Sect. 4. 



2 Discrete Support Vector Machines 

2.1 Motivation and Mathematical Formulation 

The original SVM can be described as a supervised learning algorithm that allows 
solving linear and nonlinear binary classification problems. Let S = {x, , 
denote a training set of N examples, whereby x€ represents an input vector 
(object) and y e {—1; +1} its corresponding binary class label. SVMs implement 
the idea of separating examples by means of a maximal margin hyperplane (Vapnik, 
1995). That is, the algorithm strives to maximize the distance between the objects 
closest to a linear separation boundary. This can be achieved by minimizing the 
norm of the plane’s normal w, subject to the constraint that examples of each class 
reside on opposite sides of the plane; see Fig. 1. This constraint may be relaxed 
to account for overlapping class distributions by introducing a non-negative slack 
variable The resulting formulation is (Vapnik, 1995): 

N 

min llvrll -I- P c{i)^i 

U ( 1 ) 

s.t. yiiyv ■ X + b) > 1 — V/ = 

?,->0 Wi = l,...,N, 

where H-H denotes the Euclidean norm, c(i) represents the cost of misclassifying 
object i and is a tuning parameter that enables the user to control the trade- 
off between maximizing the margin and separating examples with few errors (see 
Fig. 1). Once the quadratic program (1) has been solved, the classifier can be 
expressed in terms of the decision variables w and b, leading to the linear decision 
function: 

fix) = signiwx +b). (2) 

As is apparent from Fig. 1, SVMs incorporate the continuous slack variable to 
approximate classification error. That is, > 0 indicates that object i lies inside 
the margin (but possibly on the correct side of the plane), whereas true errors are 
identified by > 1. Consequently, both cases are penalized in (1), whereby an 
object’s distance to its respective supporting plane (^, ) is used as penalty. Note that 
all objects inside or at the wrong side of the margin as well as all objects located 
directly on one of the two support planes are called support vectors (SVs). 



118 



M. Caserta et al. 




Fig. 1 Linear SVM classifier in two-dimensional space with maximal margin 



The appropriateness of this proximate error measurement in SVMs is debatable. 
For example, the cost of an erroneous decision (e.g., misclassifying a positive object 
as negative or vice versa) is commonly assumed to depend on object i’s class only. 
Consequently, such errors, rather than distances, should be minimized using their 
associated cost c{i ) as weight. Such considerations motivated Orsenigo & Vercellis 
(2004) to develop the DSVM, which accurately counts true errors by means of a 
step function. In particular, (1) is modified in two ways; First, the L2-norm of w 
is replaced by the LI -norm to obtain a linear program. Second, is replaced by 
a binary indicator variable 6, to count misclassifications, leading to the following 
program (Orsenigo & Vercellis, 2004): 

M N 

min Uj + 

i=\ i=\ 

s.t. yi{w ■ X + b) — Qdi yi =\,...,N (- 3 ^ 

—Uj Swj<Uj V j = I, . . . , M 

uj >0 Wj = I,..., M 

9i € {0; 1} Vi = 1,...,V, 

whereby Q denotes a sufficiently large number (see Orsenigo & Vercellis, 2007b 
for a heuristic to set this parameter), and uj e account for the fact that —00 < 

Wj < 4 - 00 . 

A disadvantage of the DSVM formulation stems from the fact that it is up to now 
limited to linear classification, whereas standard SVMs can easily be extended to 
solve nonlinear classification tasks by introducing a kernel function into the dual 
of (1) (see, e.g., Vapnik, 1995). Contrary, the dual of (3) is difficult to solve and 
contains nonlinear constraints. Consequently, this paper is restricted to the linear 
case, whereas the development of nonlinear DS VMs is left to future research. 
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The standard approach to construct a DSVM classifier involves solving a series of 
linear programs, which are obtained by relaxing the integrality constraints. That 
is, the last constraint in (3) is replaced with 0 < 0, < 1 Vi = ,N . After 

computing an initial solution, the algorithm proceeds by fixing the 0, with smallest 
fractional value to zero and solving the resulting LP. The variable fixing scheme is 
continued until an infeasible solution is encountered. Then, the last fixing is reversed 
and all remaining fractional 0, are set to one, which necessarily produces an integer 
feasible solution to (3) (Orsenigo & Vercellis, 2003). 

This LPH represents a greedy approach that focuses on producing an integer fea- 
sible solution, whereas the objective, which should represent the primary quality 
indicator from an optimization point of view, is not used to guide the search. There- 
fore, one may hypothesize that the gap between the optimal solution of (3) and 
those produced by the LPH is large, which, in turn, might deteriorate the classifier’s 
predictive accuracy. 

To remedy this problem a novel heuristic is proposed that draws inspiration from 
Sniedovich & VoB (2006) and Fischetti & Lodi (2003). In Sniedovich & Vo6 (2006) 
the basic concept of a corridor around an incumbent feasible solution is introduced. 
They propose the Corridor Method (CM) as a general framework to tackle combi- 
natorial optimization problems for which an exact method (i.e., a MIP solver) is at 
hand. However, due to the large size of the search space, such exact method cannot 
be employed directly to solve the original problem. Consequently, a corridor, or a 
limited portion of the search space, around an incumbent solution is defined and, 
subsequently, the exact approach is applied on the reduced portion of the search 
space. 

In a similar fashion, Fischetti & Lodi (2003) introduce soft fixing schemes, 
extending the well-known idea of hard variable fixing schemes. The major differ- 
ence between such novel approach and the classical hard variable fixing scheme is 
that the latter permanently fixes the value of a variable, while the former only iden- 
tifies a set of variables and requires that a percentage of those variables be kept fixed 
to their current value, leaving to the solver the freedom to choose which variables 
will actually be fixed to a specific value. Consequently, the soft fixing strategy can be 
interpreted as a method aimed at defining a neighborhood, or a corridor, around the 
incumbent solution, in the sense that only points “not too far” from the incumbent 
will be considered. Given an incumbent feasible solution x' e {0, 1}", a soft fixing 
scheme can easily be defined through the application of the following constraint: 

n n 

- x)){\ - Xj) > [p^(l -x')l, 

j=\ i=l 

where p e (0, 1) is used to calibrate the tightness of the fixing scheme or the width 
of the corridor around the incumbent. Obviously, values of p close to one imply a 
tighter fixing scheme and a narrower corridor around the incumbent. 
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In the following, let us indicate with y = (w, ^ ) a feasible solution to pro- 
gram (3), with and 0 gB^, and let Z(y) be the corresponding objective 

function value. Furthermore, let b&c ( ) be a MIP branch and cut solver used 
to solve to optimality program (3). More specifically, we use the COIN-OR Cbc 
module of the Computational INfrastructure for Operations Research 
library (Lougee-Heimer, 2003). In addition, let t_cycle and t indicate the maxi- 
mum computational time allotted to each iteration of the heuristic scheme and to the 
overall algorithm, respectively. Finally, let us indicate with max_s the maximum 
number of feasible solutions to be found within each iteration of the algorithm. 
We iteratively call the MIP solver upon different portions of the search space. 
Each iteration of the algorithm terminates whenever either the maximum number 
of feasible solutions max_s or the maximum computational time t_cy c 1 e have 
been reached. Finally, the heuristic terminates when either no improving solution is 
found in the neighborhood or the maximum allotted time t is reached. Algorithm 
Heuristic_DSVM ( ) provides a description of the basic steps of the proposed 
heuristic. 



Algorithm 1 : Heuristic_DSVM ( ) 

Require: max_s, t_cycle, t 
Ensure: feasible MIP solution y* (if exists) 

1: •<— 1; p -t— 0.5 {initialization} 

2: run b&c ( ) until an initial feasible solution y® = (w®, 6^) is found 

3: Z* -t— Z(y“); y* -t— y“ {update best solution} 

4: while stopping criteria not reached do 

N N 

5: add cut Cl ^(1 — 6*~')(1 — dj) > \p ^(1 — {neighborhood around 

j=i j=\ 

incumbent} 

N 



6 : 

7: 

8 : 

9: 

10 : 

11 : 

12 : 

13: 

14: 

15: 

16: 

17: 

18: 

19: 



add cut C2 llwll -f C 0, > Z* {cut out non-improving solutions} 

Z = I 

run b&c (y*'~*, max_s , t_cycle) {k‘^ iteration} 

if no feasible solution = {w^ , 6^) is found then 
N N 

redefine cut Cl as E IE 9 ^ *] {enhance corridor} 

j=\ ^ j=\ 

run b&c (y*“', max_s , 1.5t_cycle) {repeat k‘^' iteration} 

if no feasible solution is found then 



STOP 

end if 
else 

y* ^ yi; Z* ^ ZCy'^) 

end if 

remove cuts Cl and C2 
k ^ k + \ 

end while 



{heuristic terminates} 

{update best solution} 

{restore original model} 
{update iteration counter} 
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3 Empirical Evaluation 

The empirical study employs nine benchmarking datasets from the UCI Machine 
Learning Library (Newman et ah, 1998) that have been considered in previous work 
on DSVMs (e.g., Lessmann et ah, 2006; Orsenigo & Vercellis, 2003; Orsenigo & 
Vercellis, 2004). The characteristics of each dataset are summarized in Table 1. 

Using these datasets, we strive to examine the effectiveness of the proposed MIP- 
based heuristic to construct DSVM classification models. However, it is interesting 
to reason about the meaning of effective in this context. Whereas classification anal- 
ysis aims at building models that predict accurately into the future, heuristics like the 
one proposed here are developed to solve complex optimization problems. There- 
fore, we begin with comparing the LPH of Orsenigo & Vercellis (2003) and the 
novel procedure in terms of their ability to minimize (3) and report respective results 
in Table 1 . All figures represent averages, which are obtained by means of ten-fold 
cross-validation on each dataset. 

In Table 1, Z denotes the (average) objective value and contrasts represent the 
percentage improvement/decline of the MIP heuristic over the LPH for the respec- 
tive figure and are computed as (ttlp — jim ip) litpp, whereby n is either Z or 
\{9i : 6, = l,i = 1, • • • , A^}|- In view of the fact that the resulting classifier is ulti- 
mately defined by its normal w, the cosine of the angle between the two normal 
vectors as produced by the LPH and the MIP heuristic is used as a measure of sim- 
ilarity for the separating hyperplanes (i.e., values close to one indicate similarity), 
whereby w indicates that a vector is normalized. 

Table 1 confirms the effectiveness of the MlP-based heuristic with Z being con- 
sistently lower on all but one dataset. The statistical significance of this improvement 
is confirmed by means of a Wilcoxon signed rank test (p-value 0.0391) and the 
advantage can be as large as 60%. These results confirm the impression that the LP- 
based approach is less suitable from an optimization perspective and leaves room for 
improvements. For example, the MIP heuristic produces a significantly less complex 
plane (larger margin) in the sonar case or yields solutions with substantially fewer 
SVs (e.g., house and ionosphere). 

Having confirmed the suitability of the proposed heuristic to solve (3), we may 
proceed with comparing the predictive accuracy of the classification models as pro- 
duced by the MlP-based and the LP-based approach. In particular, the area under a 
receiver-operating characteristics curve (AUC) as well as classification error (CR) 
are used as performance indicators. The former represents a general measure of 
predictiveness (Fawcett, 2006), whereas CR is considered because a minimization 
of classification errors originally inspired the development of DSVMs (Orsenigo & 
Vercellis, 2004). 

Comparative results are given in Table 2, whereby figures in square brackets rep- 
resent the standard deviation of the respective measurement during cross-validation 
and contrasts are computed as above. Intriguingly, Table 2 reveals that the appealing 
results of the MlP-based heuristic do not converge into better classifiers. Although 
the novel method produces classifiers with lower (better) CR in some cases, a 
Wilcoxon signed rank test indicates that the two classifiers do not differ significantly 
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Table 2 Predictive performance of DSVM classifiers constructed by different solvers 





LP-based heuristic 


MIP-based heuristic 


Contrasts in percent 


AUC 


CR 


AUC 


CR 


AUC 


CR 


Ac 


0.92 (0.01) 


0.07 (0.01) 


0.91 (0.03) 


0.07 (0.02) 


1.57 


4.98 


Gc 


0.86 (0.03) 


0.10 (0.02) 


0.85 (0.04) 


0.10 (0.02) 


0.15 


1.53 


Heart 


0.89 (0.04) 


0.10 (0.03) 


0.89 (0.06) 


0.10 (0.03) 


0.68 


-7.72 


House 


0.98 (0.02) 


0.02 (0.01) 


0.99 (0.02) 


0.02 (0.01) 


-0.46 


10.61 


Ionosphere 


0.85 (0.08) 


0.07 (0.03) 


0.88 (0.07) 


0.07 (0.03) 


-4.15 


5.34 


Liver 


0.71 (0.06) 


0.17 (0.04) 


0.67 (0.09) 


0.20 (0.03) 


6.13 


-18.85 


Pima 


0.83 (0.06) 


0.12 (0.02) 


0.82 (0.06) 


0.12 (0.03) 


0.33 


1.04 


Sonar 


0.79 (0.11) 


0.14 (0.05) 


0.81 (0.07) 


0.13 (0.04) 


-2.75 


12.04 


Wbc 


1.00 (0.01) 


0.02 (0.01) 


1.00 (0.01) 


0.01 (0.01) 


-0.06 


11.76 


Mean 


0.87 (0.05) 


0.09 (0.02) 


0.87 (0.05) 


0.09 (0.02) 


0.16 


2.30 



(/?-value 0.4961). This pattern is even more explicit when considering the AUC 
results. Here, it is the LPH that produces better classifiers (higher AUC) in five out of 
nine cases. Though, differences are once more insignificant. Consequently, the per- 
formance comparison suggests that the novel MIP-based heuristic does not construct 
more accurate classifiers, although it is demonstrably better from an optimization 
perspective. This view is further supported by examining the correlation between 
the Z -contrast (Table 1) and the contrasts of classifier performance (Table 2). Given 
that Demsar (2006) casts doubt on the appropriateness of parametric statistics for 
classifier comparisons, Kendall’s r (see, e.g., Zar, 2007) seems a preferable indica- 
tor to capture correlation (or the lack thereof). Values of 1 (—1) for r indicate perfect 
correspondence (disagreement) between, e.g., AUC and Z-contrasts, whereas values 
around zero indicate independence. In the present case, no significant correlation 
can be detected between AUC and Z-contrasts (t = —0.44 [/?-value 0.12]) or CR 
and Z-contrasts (r = 0.39 [/7-value 0.18]). This confirms that there is no apparent 
relationship between the quality of the solution to (3) and the predictive performance 
of the resulting classifier. 

When assessing the previous finding, it is important to consider that the solution 
to (3) is obtained from the training dataset, whereas predictive performance is mea- 
sured on hold-out data. Clearly, one would not expect a perfect correlation between 
in-sample and hold-out sample results. However, the absence of any significant asso- 
ciation is counterintuitive, especially since DSVMs incorporate the ideas of SRM. 
One may speculate that modifications of the DSVM formulation (3) could help to 
increase the alignment between the mathematical program guiding classifier con- 
struction and the actual objective of classification (i.e., predictive accuracy). In fact, 
Orsenigo and Vercellis have already proposed enhancements over the formulation 
considered here (see, e.g., Orsenigo & Vercellis, 2003, 2007a,b) and we plan to 
extend the previous analysis to these alternatives in future research. Furthermore, it 
seems advisable to consider the preceding correlation analysis as a statistical tool, 
which should generally be applied when crafting new classifiers that ground on 
mathematical optimization to appraise appropriateness from a different angle and 
augment standard measures of predictive performance. 
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4 Conclusions 

We have developed a novel MIP-based heuristic to construct DSVM classifica- 
tion models that utilizes a standard MIP solver and integrates the concepts of a 
corridor around incumbent feasible solutions and soft fixing. The effectiveness of 
the proposed approach has been confirmed by means of empirical experimenta- 
tion, achieving significant improvements over the previous standard. Furthermore, 
a correlation analysis has been undertaken to evaluate the DSVM classifier from 
an optimization perspective. It has been shown that the correspondence between 
the goal of classification and its respective operationalization in form of an objec- 
tive function is not yet satisfactory for the DSVM formulation considered in this 
work. That is, better solutions to the mathematical program underlying DSVMs do 
not convert into better classifiers. This highlights the potential to design and assess 
novel formulations in future research. Such models could address the classical clas- 
sification setting, or generalize to complex decision making scenarios, possibly 
including different types of prior information or application-specific constraints. 
DSVMs could be of particular value in such complex settings, thanks to the flex- 
ibility and expressive power of integer programming, which has always been at the 
core of this classifier. 
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Predictive Classification Trees 



Stephan Dlugosz and Ulrich Miiller-Funk 



Abstract CART (Breiman et al., Classification and Regression Trees, Chapman 
and Hall, New York, 1984) and (exhaustive) CHAID (Kass, Appl Stat 29:1 19-127, 
1980) figure prominently among the procedures actually used in data based manage- 
ment, etc. CART is a well-established procedure that produces binary trees. CHAID, 
in contrast, admits multiple splittings, a feature that allows to exploit the splitting 
variable more extensively. On the other hand, that procedure depends on premises 
that are questionable in practical applications. This can be put down to the fact that 
CHAID relies on simultaneous Chi-Square- resp. F-tests. The null-distribution of 
the second test statistic, for instance, relies on the normality assumption that is not 
plausible in a data mining context. Moreover, none of these procedures - as imple- 
mented in SPSS, for instance - take ordinal dependent variables into account. 

In the paper we suggest an alternative tree-algorithm that: 

• Requires explanatory categorical variables 

• Chooses splitting attributes by means of predictive measures of association. The 
cells to be united - respectively the number of splits - are determined with the 
help of their conditional predictive power 

• Greedily searches for a part of the population that can be classified/scored rather 
precisely 

• Takes ordinal dependent variables into consideration 

Keywords Booty trees ■ Factor reduction • Ordinal measure of dispersion • Predic- 
tive measure of association. 



1 Statement of the Problem 

Trees are popular among users of statistical methodology as they: 

• Are non-linear and require non-parametric model assumptions only 

• Include factor selection in an integrated way 
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• Provide precise forecasts - additionally possessing explanatory power, in contrast 

to black box procedures like ANN or SVM 

• Yield findings that are easy to communicate 

The last two statements, however, only hold true for trees that on the one hand 
do not grow fast and on the other hand result in leaves representing a sufficiently 
large and homogeneous subpopulation. The last trait, of course, is essential to secure 
the accuracy of the final decision. These requirements imply conflicting aspects. In 
situations where a comparatively large number of interacting factors from various 
domains is involved, some sort of factor preselection seems to be unavoidable in 
order to find a trade-off. Beyond that, the performance of every statistical procedure 
suffers if it has to cope with many factors that merely express noise (Hastie et al., 
2001 ). 

Tree induction is facilitated if it is only required that an adequate percentage of 
leaves are good natured in the sense stated above. For some applications a procedure 
designed for that purpose might be more useful. In order to illustrate this point, 
imagine a mailing action to foster a new product. For the purpose of such a campaign 
not all of the addresses available have to be classified precisely. All that is needed is 
a “booty-algorithm” that singles out a sufficient number of prospective addresses to 
be included into that promotion. A tree grown with that feature in mind will exhibit 
numerous “randomization-leaves”, at which a clear-cut decision is neither advisable 
nor needed. 

In the paragraph to come we shall describe a device for factor reduction. The 
following sections deal with splitting measures and tree-induction, respectively. The 
discussion is restricted to classification, but - with some modifications - carries over 
to the regression case. 



2 Factor Selection 



With trees, all explanatory factors are finally rendered categorical during tree- 
construction. We think it advisable to make discretization part of data preparation. 
Thereby, all factors can be brought on a comparable scale allowing for just one split- 
ting measure. Moreover, the maximal number of categories (and splits) is limited. 
Thereby, a bias towards quantitative variables can be avoided and noise in the fac- 
tors is diminished. Technically speaking, that categorization can be carried out w.r.t. 
the target variable with the help of predictive measures. 

Factor selection typically involves two kind of activities; 

• Ranking of factors resp. factor combinations 

• Factor reduction: elimination, prototype and outlier clustering of factors 

Attribute clustering is applied to soothe the drawback of a factor by factor anal- 
ysis, i.e., the complete neglect of possible interactions. It is hoped, of course, that 
factors from different groups are approximately “orthogonal”, whereas factors in 
one group have comparable impact on the target value and that one (or a few) of 
them might represent the cluster sufficiently well. 
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2.1 Factor Reduction 

In order to handle factor reduction, we have to compare the impact of two (sets of) 
explanatory factors on the target variable. The impact of factor X will be measured 
in terms of the (empirical) categorical regression function' 



For purely nominal, nondichotomous variates it seems to be advisable to choose a 
different metric or to work with the conditional probabilities themselves to prevent 
a dependency on the coding. 

Based on that pseudo-metric, a standard clustering algorithm can be employed in 
order to group factors resp. to identify “outliers”. Each cluster can be represented, 
for instance by that factor(-combination) that shows the highest predictive power 
w.r.t. the target variable (cf. Sect. 3). 



2.2 Example 

The following illustration of the factor clustering principle introduced is based 
on the credit data from Fahrmeir, Hamerle, and Tutz (1996). This small data set 
describes the status of 1,000 former debtors (binary: credit repaid or not) compris- 
ing twenty factors concerning the users or the credits. The dendrogram shown in 
Fig. 1 results from a hierarchical cluster analysis with complete linkage based on 
the distance matrix for the these factors that has been calculated with metric (1). 

The factors two to six, i.e., “credit period” (months), “payment behavior”, 
“intended use”, “amount”, “savings account and other securities” are separated from 
the other factors (like “gender”, “family background”, “employer” etc.). This shows 
that a factor selection is recommendable. Otherwise, the less important factors may 
confuse and destabilize tree growing. These five factors are sufficient. Wald-Tests 
following a logistic regression show that the factors “running account”, “credit 
period”, “payment behavior”, and “intended use” are significant. The results match 
very well - except for the first factor that has been replaced by factor “savings 
account and other securities” (cf. Fahrmeir et al., 1996). 



* # denotes the cardinality. 



rx(k) = Mode [jtx(-\k)) 




The difference in impact of two factors V, W is 




( 1 ) 



k.l 
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CASE 

Label Num 

Telefonanschluss 19 

Gastarbeiter 20 

bestehendes Ifd. Konto 1 

Beruf 17 

Anzahl unterhaltsber . Pers. 18 

Typ der Wohnung 15 

Anzahl Ratenkredite bei Bank 16 

Alter in Jahren (diskret) 13 

Weitere Ratenkredite 14 

In der jetzigen Wohnung seit 11 

Hdchster vorhandener Vermdg.wert 12 
Familienstand und Geschlecht 9 

Weitere Schuldner/Biirqen 10 

beschaft. bei jetzigen AG seit 7 

Ratenhdhe in % des verf. Einkom. 8 
Sparkonto Oder WP 6 

Darlehenshbhe (diskret) 5 

bisherie Zahlungsmoral 3 

Laufzeit in Monaten (diskret) 2 

Verwendunqszweck 4 



Fig. 1 Dendrogram 



Rescaled Distance Cluster Combine 
0 5 10 15 20 25 

+ + + 4 + 4 



3 Predictive Measures of Association 

We consider measures of predictive association that mimic and take the form 

^cz.(r:X) = l-i(^). ,2, 

Here, L is a measure of location and D a measure of dispersion. (L, D) have to 
be chosen in a way that ensures 0 < ^ < 1 . Typically, L is taken to be one of the 
standard functionals. The choice of D is more delicate and, of course, depends on 
the scaling (Stevens, 1946). 

Note 7. If / C M™, resp. / C K" denotes the (Borel-)domains of Y and X, then 
Ai d is defined on M x V, where M is a class of Markov kernels from J to I 
including C{Y\X) and 7^ is a model class of probabilities on J for C{X). In what 
follows, 7 C R™ and J C M" are discrete. 

A set of axioms for measures of dispersion for categorical variables has been 
published recently in Miiller-Funk (2007) and is extended here to ordinal variates: 

Definition 1 (Ordinal measure of dispersion). Let Dg be a functional Dg : V ^ 
[0, oo[, where V denotes the class of all hnite stochastic vectors, i.e., V is the union 
of the sets Vk comprising all probability vectors of length W > 2. If Dg satisfies 

(PI)* Dg{pK,. ■ ■ , pi) = Dg{px, pk) for &\\ p = (px,...,pk) e Vk, 

(MD) Dg(p) = 0 iff /I is an unit vector, 

(MA) With Xmed ■= inf{jc e K : F{x) > j}: p(X > -\X > x,ned) > > 

'\X ^ Xffied} and t^{X > ~\X ^ Xmed^ ^ Pi.^ ^ ' I Vg{p^ ^ 
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Do{q) for all p,q & Vk (cf. Witting & Muller-Funk, 1985, p. 213 for the 
definition of ■> ), 

(SC) D{pi pk-i , r, s, pk+u Pk) > D{px,..., pk-i , Pk, Pk+\ ,...,Pk) 
where p e Vk , 0 < r,s and r + s = pk, 

(MP) D[{\ — r)p + rq) > (1 — r)D(p) + rD(q) for 0 < r < 1 and p,q & Vk, 
and additionally continuous on all of Vk, K >2, 

(EC) Do{pi,...,pK,0) < Do(pi,...,Pk), 

Do{0, pi, . . . , Pk) < Doipi, P k) and 

Doipu ■ ■ ■ , Ps,0, Ps+i, ■ ■ ■ , Pk) > Do{pi,...,pK)ioral\ p e Vk, 

it is called a ordinal measure of dispersion. Leaving out axiom (PIO) it is called a 
grouped ordinal measure of dispersion. 

Note 2. Axiom (PI) is asterisked because the implied symmetry is not essential for 
a measure of dispersion. Like the variation coefficient is defined for positive values 
only. This is particularly useful for survival analysis and other time-dependent data. 

Example 1. A well-known example for an (metric-)ordinal measure of dispersion is 
the interquantile range: inf{x e M : F{x) > 1 — — inf{x e M : F{x) > q} with 

q G [0, 1]. Unfortunately it does not satisfy axiom (MP) (choose p = (0.5, 0, 0.5), 
q = (0, 1,0) and r = 0.8 with the interquartile range), hut all other axioms are 
satisfied. This is why this measure is not suitable for growing classihcation trees. 

The general form of an ordinal measure of dispersion can be based on the 
following principle: 



where L denotes an arbitrary measure of location and D„ an arbitrary nominal mea- 
sure of dispersion. Furthermore, we need a splitting function S{r. p) = (Si, S 2 ) 
that produces r — 1 “splits” of the probability vector p for an ordinal variate. This 
function reflects the kind of ordinality included within the variate. 

The following splitting functions are suggested, which have been inspired by 
various authors contributing to ordinal regression: 



Example 2. There are three possible realizations of the splitting function S : 
• “Splits” (cf. McCullagh, 1980) 



Do(p) = L[D„{Si(\,p)),..., D„ (5i (k-\,p))]. 




• “Moving Window” (cf. Agresti, 1984) 



S(r,p) = -{1+ Pr - Pr+\, 1 -Pr + Pr+\)- 
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• “split-conditional Window” (cf. Cox, 1988) 



S{r,p) = + X! 




The most simple and intuitive ordinal measure of dispersion based on splittings 
is given by 

K-l / r K \ 

E p\ 

r=\ \i = l (=r-|-l / 

For this ordinal measure, we can state a consistency theorem: 

Proposition 1. Let p be an interior point of Vk ond Do an ordinal measure of 
dispersion based on splittings. Further let Dg = 'Yif ^ Dg {p) = 

SiPi) where g is a continuous and concave function on [0, 1], g(0) = g(l) = 0 
and g{t) > 0 for 0 < t < I (cf Miiller-Funk, 2007 for details and examples). 
UK = K-f\,...,\). 



with 



C (^{Do(pn) - Do(p))) ^ ^ (E j 

C ( Wr ) = I ^ ^ 

( C -I- P = ‘^k. 



where 



^g,r = [^'(•S'i(i-,/?)).g'(*S2(r,/i))]S[g'(Si(r,/i)),g'(5'2(r,;7))]^ 

and Ti.i, . . . , T^-1,2 is a sample of standard-normal variates and where Ai,i, . . . , 

1/2 1/2 

Xk-\ 2 denote the eigenvalues of , 

Hr = diag[u"{S\{r,p)),u'{S2{r.p))\ 

Proof, follows directly from Proposition 3 in Miiller-Funk (2007, p. 6). 



4 Tree Induction 

The well-known tree induction for CART is based on balancing the two criteria 
“node purity” and “node size” (Breiman, Friedman, Olshen, & Stone, 1984): 



A/ : I{p(t)) - f(tL)l{p(tL)) - f(tR)l{p(tR)). 



Predictive Classification Trees 



133 



In contrast to this impartial view, we would like to suggest the more greedy “booty 
algorithm” approach. It is based on the idea of chopping off homogeneous subsam- 
ples in order to pick out relatively large groups with simple discriminating criteria 
as long as possible. To avoid atomization - which would generate absolutely pure 
nodes - we have to ensure a minimal node size that should be determined by sta- 
tistical considerations. The generated “booty tree” will exhibit two kinds of leaf 
nodes: Decisive nodes that are the “booty”, and randomization nodes that contain 
the “offal”. 

A pseudo-code for this algorithm is: 

For each node: 

1. Select split factor Xj for node t: A]){y\xt) = max 

2. (a) Select first split category ;* = argmax;* (Afl(y|xr = /*)) 

Treat remaining categories initially as 

■ One bin if Xt is nominal 

• Possibly two bins if x^ is ordinal (left/right) 

(b)For each “offal” bin determine 

optimal factor Xs,s ^ t, best category j * . Split 

• At j* , if Ao{y\xs = j*) > Ao{y\xt = il) vertical growth 

■ At / * else ^ tree grows horizontally 

where / * denotes the second best category of Xt 

3. Termination: MIS < MISaax or node size < Amin 
decision: majority rule or randomization probabilities 

Step 2(b) incorporates the idea of using multiple splitting a node (cf. Biggs, 
De Ville, & Suen, 1991) and is illustrated in Fig. 2. 

Open problems to this tree induction algorithm are handling of very large 
databases, which could either be done by using a sequential algorithm (Zhang, 





Fig. 2 Booty tree induction with multiple splitting 
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Ramakrishnan, & Livny, 1996) or - more appealing - by using random sampling a 
subset that is used for tree induction. 

Another problem - concerning all tree induction methods - is tree stability. The 
usually provided technique for variance-reduction of tree-based estimators is a kind 
of bagging and boosting (forests) (Breiman, 1996, 1998, 2001). Unfortunately using 
these techniques, the easy-to-explain capability of trees gets lost. In order to main- 
tain this feature another approach is needed, which is sketched as follows: Stability 
of the tree as a whole is not necessary, but stability of the feature space partition. 
Therefore, as a first approach, we con build up superpositions of pavings within the 
attribute space and “reconstruct” an appropriate tree afterwards. To reduce the size 
of the tree, we have to reduce the fragmentation of leaves by re-uniting neighboring 
fragments. 

For lack of space, we have not been able to introduce an empirical study into this 
paper. 
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Abstract For the structure analysis of non-metric data, it is natural to classify 
objects according to the properties they possess. An effective model to analyze the 
structure of similarities between objects is the random intersection graph generated 
by the random bipartite graph with bipartition (V, W), where V is a set of objects, W 
is a set of properties, and according to some random procedure, edges join objects 
with their properties. In the related random intersection graph two vertices are joined 
by an edge if and only if they represent objects sharing at least s properties. In this 
paper we study the number of isolated vertices and its convergence to Poisson dis- 
tribution. We generalize previous results obtained for special cases of the random 
model and for = 1 , only. Our approach leads us also to some interesting results on 
dependencies between the appearances of edges in the random intersection graph. 

Keywords Isolated vertices ■ Non-metric data • Random intersection graph. 



1 Introduction 

In applications like the analysis of non-metric data, it is a natural approach to clas- 
sify objects according to the properties they possess. Relations between objects and 
their properties can be described by a bipartite graph with the bipartition (V, W) of 
the vertex set V U W, where the « -element subset V represents the objects and the 
m-element subset W represents the properties. In such a bipartite graph, edges then 
connect objects (from V) with their properties (from W). Two objects are called 
“similar”, if they share at least s properties for a given positive integer s. A useful 
concept to describe connections between similar objects is that of an intersection 
graph generated by the bipartite graph of the original relations between objects and 
properties. The intersection graph would have the vertex set V of objects where two 
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objects are joined by an edge if and only if these two objects share at least s prop- 
erties. In the context of cluster analysis, certain subgraphs of the intersection graph 
on V, for example the connected components, correspond to object clusters. 

Then an effective model to statistically analyze the structure of similarities 
between objects is the random intersection graph with vertex set V 

generated by the random bipartite graph BQ{n.m,'P(m)) with bipartition (V, W). 
In BQ(n,m,V(m)) the number di of neighbors (properties) of a vertex v, e V 
(object) is assigned according to the probability distribution V(m) and the actual set 
of properties is taken then uniformly from all d, -element subsets of W. Moreover 
in Gs in,m, V(m)), an edge connects vi and V 2 from V if and only if they have at least 
s common neighbors in BG{i'i,m.V(,n))- The model, called the active intersection 
graph, was introduced in Godehardt and Jaworski, 2002 as a model for classification 
especially for finding cluster structures in non-metric data sets. Under the hypoth- 
esis of homogeneity in a data set, the bipartite graph and the related intersection 
graph can be interpreted as the realization of a random bipartite graph together 
with its active random intersection graph (for more information on how graph- 
theoretical concepts can be used in defining cluster models, revealing clusters in 
a data set, and testing their randomness see Bock, 1996; Godehardt, 1990 for metric 
data, and Godehardt & Jaworski, 2002; Godehardt, Jaworski, & Rybarczyk, 2007 
for non-metric data). 

Our main purpose in this paper is to study the number of isolated vertices 
(objects similar to no other) in Gsin,n^-'P (,„))■ Previous results concerning this 
problem considered only the case, where each object had the same number of prop- 
erties (the degenerated distribution) or each vertex had a binomially distributed 
number of properties and 5=1 (see Karohski, Scheinerman, & Singer-Cohen, 
1999; Bloznelis, Jaworski, & Rybarczyk, 2009; Godehardt et al., 2007). In our new 
approach we treat dependencies between the appearances of edges for 5 > 1 , which 
is important from the application point of view; and we give results for the case, 
where the number of properties may vary for different objects (according to the 
distribution V(m))- 



2 Definitions and Main Results 



Definition 1. Let V = {vi , . . . , v„ } and W = {w \ , . . . , Wm(„)} be disjoint sets, s be 
a positive integer and V{m) = (Pq, Pi , • • • - Pm) be a probability distribution. More- 
over let D(vi), . . . , D{vn) be a family of random subsets of W generated according 
to the following procedure. Independently for all 1 < i < «; 

1. First Z/, the cardinality of a set of properties D(v,), is assigned to v, according 
to the probability distribution V(m) (i e., Pr{Z, = d} = Pd for all 0 < <7 < m). 

2. Then given Z, = t/, a set of d properties is assigned to v, uniformly over the 
class of all <7 -element subsets of W, i.e., for a given <7 -element subset A C W, 
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Pr{D(v,) = A\Zi = d}= . 

A random intersection graph Qs(n,m, V(m)) is a graph with vertex set V and edge 
set E = {{vi,vj} : |£>(v,) n D(vj)\ > 

In the following considerations we will assume that s is an arbitrary, positive integer. 
Additionally, f)(v, ) will be the individual set of properties of a vertex v, . According 
to the definition, Z„(v,) = |f)(v,)| is a random variable with the distribution V(m)- 

Theorem 1. Let Qs{n. rn, V(m)) be a random intersection graph. Moreover, let 



s\ 



for a constant c and 

i^n^s (.tbn)s ^ 

(d„)sCo{n) 

where the random variables U„ tend in distribution to U, ant/U„ and\5 have values 
in finite sets of the same cardinality: 

1. If u)(n)\nn = o(l), then the number of isolated vertices in Qs(.n,t'n,V(m)) tends 
in distribution to a Poisson distribution Po(X) with X = e~‘^ . 

2. If a>{n)\nn = 1, then the number of isolated vertices tends in distribution to a 
Poisson distribution Po{X) with X = 

3. If to (n) Inn — >• oo, E(U) = 0 and U has at least two values, then with high 
probability Qs{n, m, V{„,)) has at least one isolated vertex. 

Since the proof of the main theorem is technical, we have divided it into several 
propositions and lemmas. In Sect. 3 we will give all those statements. In Sect. 4 
we will give the proof of the main theorem. In Sect. 5 we will prove the fol- 
lowing corollaries related to the cases where V(,n) is the degenerated or binomial 
distribution. 

Corollary 1. Let V{,„) = {P\, ■ ■ ■ , Pm] be such that Pd„ = 1. If 



n{d„)] 

i’! /«•' 



— Inn 



c 



for a constant c, then the number of isolated vertices in Qs (n,m, V(m)) tends in 
distribution to a Poisson distribution Po{e^‘^). 



Corollary 2. Let V(m) = {P\, ■ ■ ■ , Pm) be a binomial distribution Bin{m, p). If 
m = n^ for 8 > \/s and 



n{\inp\)] 

Inn 

/n' 



( 1 ) 



for a constant c, then the number of isolated vertices in Qs(ti,m,'P(m)) tends in 
distribution to a Poisson distribution Po(e“‘ ). 
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3 Preliminaries 

In the proofs a random intersection graph with given cardinalities of the property 
sets which are assigned to the vertices will be useful. 

Definition 2. Let V = {vi , . . . , v„} and W = {w\ , . . . , Wm(n)} be the sets and s a 
positive integer. Given a vector d = d{n) = (di(n). d 2 {n), . . . , dn{n}) of nonneg- 
ative integers, let D{y {), . . . , D{v,,) be a family of random subsets of W generated 
according to the following procedure. 

Independently for all 1 < i < «, a set of properties D(v,) is assigned to v, uni- 
formly over the class of all c?, -element subsets of W, i.e., for a given d, -element 
subset^ c W, Pr{D(vi) = A} = 

A random intersection graph Qs(n,in, d) is a graph with the vertex set V and the 
edge set E = : |D(v,) n D(vj)\ > s}. 

Notice that 

Pv{gAn,m,d) = G} = Pr{0,(«, = G \ {Zu . ■ ■ , Z„) = d}. (2) 

Our preliminary results will concern Qs(n,m, d). By Aij we will denote the event 
that two vertices v, and Vj are joined by an edge in Qs{n.m, d). We will assume 
that s < di, for all i, since vertices with fewer properties are always isolated. D = 
max{|D(v)| : v e V} will be the maximum possible number of properties of a 
vertex. Generally we assume that D^/ m — ^ 0. Moreover in all notations O(-) and 
o(-) will be uniformly bounded over all possible vectors d such that max{d, : 1 < 
i < n} < D. 



3.1 Edge Probability 

Proposition 1. Let d\, d 2 > s > \. Assume that / m 0 as n oo . The edge 

probability in Qs{n,m,d) satisfies as n — >• oo 



Proof. Let A be a family of subsets of W with exactly d 2 elements intersecting with 
a given D{v\) C W on at least s elements. Then 





Therefore after standard calculations we obtain the thesis. 
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3.2 Isolated Vertices in Qs{n,m,d) 

In this section all the results will concern Qs(n,m,d), which means that from now 
on we assume di = |f)(v, )| for all i e {1 , . . . , n} = [n] are given. 

Proposition 2. Let k be a positive integer and I = {i\, . . . ,ik] c [«] be a subset 
of indices. The probability that the vertices {v,j, . . . , v,j.} are isolated in Qs{n, m,d) 
equals 



n n 

i^i'el je[n]\liel 

= exp(-0{D^/m)) ]~[ exp ( ~ Pr{A; }(1 + OjD^I m)) 

y€[«]\/ V 16 / 

Proof. Assume that ^ is a given positive integer. First, given d, we will estimate 
the probability that no two vertices from the set {vi, . . . , Vk} are joined by an edge 
in Qs{n,m, d). Note that for given distinct indices i\, i 2 , h, Ia, the events and 
A 34 are pairwise independent, as well as the events and , 4 , 2,3 Therefore 

'y — — y ^ Px{Ai^ji P' ”^12/2} 

\<i<i'<k l<i’i< j\<k 

1<Z2< jl<k 

(1-^ X! X! Pr{A''}. 

\<i<i'<k \<i<i'<k 



u Ai' < E 

Hence Proposition 1 implies that 



Pri 


! n 


i = l-Pr| 


[ U 


i = l-0{k^D^^ /?n^). 


(3) 


1 


['i<i<i'<k j 


1 1 




\ 





Pr j U Ai, 



l<i<i'<k 



On the other hand we have 



Pr 



Now we will give estimates on the probability that given the independent set 
{vi, . . . , Vk} it consists of isolated vertices in Qs{n.m, d). Let I? be a family of 
all sequences (with repetitions) of sets {Di, ..., Di;}, such that |D, I = di,Di C W 
and \Dj fl £),'| <5 — 1 for all i 7 ^ i', I < i < k, I < i' < k. For given D e 29 let 



PrD{-} = Pr{-|D={D(vi),...,£»(v,t)}}. 
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For given j , k < j < n, we want to estimate PrD{ni<i<yt Note that 



k 

Pro I Ay I > ^ Pro {Ay} 

\<i<k 1 = 1 



2 PrD{Ai; n Ayy} 



and for j > k 



Pro} Ay } = Pr{ Ay } ■ 



Let be the event that vertices v,j,v ,-2 intersect on exactly t properties, 

t < — 1. Then for any D which implies we obtain 

ProlAyy Ayy} = PrD{Ai; H Ay; I ^iiyCO} 

^ (di\(di^-t\(m-s-l\ / (m\ 

-\s)\ 1 )\dj-s-l) / \dj) 

< (diMdjh ^ ^ 



m 



s\m^ 



< -ll*!W^(i + 0(DV».))~-Pr{A,j). 

m s\m^ m 

where 0{-) is uniformly bounded over all possible D. Hence 

Y, } > Prn I IJ Ay | > Y Z! P } 

1 = 1 l<i<k 1 = 1 

- (1 + 0 {kD^ jm)) . 



Thus 



Pric 



Pi = fl - (1 + OikD^lm)) Y Pr{A; 

l<i<k ' ( = 1 



} ■ 



Obviously rii<i<i'<i Ai' = UneA® = {^(vi), . . . , D(vk)}}- Also given D, the 
events ni<i<jt A;i > • • • ’ Hki <t A;, independent for , jt > k. Hence 



p'j n n + n 

k<j<n\<i<k \<i<i'<k 



= Ep'»j n n n 

Dec k<j<nl<i<k \<i<i'<k 

= E n p'»| n n 



B)€T>k<j<n ^<i<k 



l<i<i'<k 
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which implies that in Qs (njn,d) 



Pr 



n n 

k< j <n \<i <k 



n -*:r] 

\<i<i'<k 



k 

= n {^-i^+0{kD^/m))Y,MAij}). 

k<j<n /=1 



(4) 



Notice that Pr{^,y } = 0(D^/m) is uniformly bounded over all /, j . Thus by (3) 
and (4) 



PM n n n ^5 

k<j<nl<i<k 

k 

= exp(-0 (dV'm)) n exp(-^Pr{A7}(l + 0 (^V'w)))- 

k<j<n i=l 

Obviously the same follows for any A:-element subset of indices (after relabeling the 
vertices), and therefore the proof is completed. 

Notice that since = O {in'^ Inn /n), {di)sidj)s /(s\m^) < /{s\m^) = 

0 (Inn /« ) and /m = o{l /Inn ) = o(l). It follows from Proposition 1 that 
Pr{,4y } = 0(D^^ /m ^ ) and thus Proposition 2 implies the following result. 

Propositions. Letk be a positive integer, = {D{n))^'^ = O (m^lnn /n) and 

1 = {ii, . . . ,ik} c {l,...,n} = [n] be a subset of indices. For any d such that 
max{fi?i \ < i < n} < D, the probability that the vertices {v,j, . . . , v/^,} are 
isolated in Qs{n,m,d) is given by 

p/ n n n-^«|=“p(»o))“p(- 1: 

j€[n]\l i€l y€[«]\/ i€l 



where o{\) is uniform overall d such that max{di ■ I < i < n} < D. 

Lemma 1. Let M be a positive integer, f = (fi{n), . . . ,fM(n)) e and~p = 

{p\(n), . . . , pM(n)) G [0, 1]^ be sequences of vectors. If d = {d\, . . . , is such 
that di G {f\{n), . . . , fM(n)} for all 1 < t < n, = O Vi = 

{i G [n] : di = fi{n)} and pi{n)n{\ - < |V/| < pi{n)n{\ + for given 

8{n) and 1 < / < M, then 



pr{ n n n 

\<i<i'<k k<j<n\<i<k 



= exp({?(l) + 0(6{n))) 



( M M 

-E ki' ^ Pi(n) 

i'=i i=i 



nWiUMs 

s\m^ 
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Proof. Let D — and ki = |V/ fl 7|. Since |/| = k 

is a given positive integer and n(D)j /{s\m^) = O(lnn), we obtain from Propo- 
sition 3 



p-l n n n 

l<i<i'<k k<j<n\<i<k 



= exp(o(l))exp 




(Pl)s{.Ms 

sbn^ 



and the statement of Lemma 1 follows. 

Let Y(d) = Y{d,n) he a random variable counting the isolated vertices in 
Qs{n,m, d). Then 



n 

Y(d) = E Ij , where I,- 

i=i 



1 if Vi is isolated in Qs(n. in, d), 

0 if V, is not isolated in Qs (it ,m ,d) 



( 5 ) 



and 



E(Y(d))k = k\ Y, ■ . . . ■ 

h<-<ik 

=t! prj n n 

IQ[n].\I\=k y€[n]\/i€/ 



Let N(n) = (N\(n), . . . , NM(n)) be a family of vectors of integers such that 
^M(«) — Yi We say that c? is a realization of N(n) (d < N(n}) if for d, 

\Vi\ = Ni(n) for all 1 < / < M (where V; is defined as in the statement of 
Lemma 1). Notice that for d < N(n) and d' < N(n), since d and d' differ only by 
order of entries, we have 

E(Y(d))k = E(Y(W))k = E(Y(d^))k, 

where d^ i^ such that di = fi (n) for all (N\ -\ h Ni) — Ni < i < N\ Ni 

and I < I < M. 

Lemma 2. Let M be a positive integer, f = (f\(n), . . . , fuin)) e and ~p = 
{pi(n), . . . , pM(n)) e [0, 1]“ be sequences of vectors. Given any N(n) = (Ni(n), 

. . . , NM(n)) such that 



pi{n)n{l - < Ni(n) < pi(n)n{\ + for all I < I < M, 



(6) 
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we have 



^{Y(dt,))k = exp (0(1) + 0{&(n))) {A{p{n), P(n)))’^ ^ 



( 7 ) 



where 



Proof. To prove the above lemma one can note that 





and check that uniformly over all choices of ki we have 




/i = 1 




4 Proof of Theorem 1 

Let in Qsi.n.ni,'P(m)) -Z,, e . . . , with probability 1. Moreover let 

pi{n) = Pr{Z„ = Pi(n)}, ~p = {pi{n), . . . , p^in)} and assume that the sequences 
{pi{n))„&i are bounded away from zero by a constant. Let N be a random vector 
(Ni, N m), where N; = |{t e [n] : |D(v,)| = ;S;(n)}|, for 1 < I < M. Set 

8(n) — ^ 0 such that ^ In^ n / n = o(8(n)). Then from the Chernoff bound with high 
probability, 



for all 1 < / < M. Let Bs(n) be a set of vectors N such that (6) is fulhlled. 

In a very similar way as in (5) we may define Y - the random variable counting 
the isolated vertices in Qsi.n.ni,'P(m)) - as a sum of indicators. From (2), it follows 
thatE(T(J))t = E((T)i; | (Zi, . . . , Z„) = d). Hence (7) and (8) imply 




( 8 ) 




N d<iN 



J2^(Y{dN))kPT{n = N}+ ^E(T(d^)),Pr{N= IV} ( 9 ) 






N^Bsm 



= exp(o(l))(^(/i(«),j6(«)))'^ + o(l) ~ (A(p{n), . 
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Let n (d„)^ / s\ m'^ = ln« + c„ with c„ ^ c for a constant c, and let 
i(Zn)s — {dn)s) / ({dn)s td(n)) = U„ for o)(«) > 0 and U„ tending in distribu- 
tion to U. Assume that the random variables U„, for all n, and U have values in the 
hnite sets {a\(n), . . . , aM{n)} e R and {a[, . . . , e M, respectively. Moreover 
assume that Pr{U = a'/} > Oforall 1 < / < M). Let (Pi{n))s = (d„)s(ai(a(n) + 1) 
for all 1 < I < M . Thus from convergence in distribution (if ai < ai+i and 
a'j < for all 1) we have pi(n) — s- Pr{lU = pi{n) is bounded away from 
zero for large n, ai(n) — > a'l, EU„ — ^ EU and Eexp(— 1U„) ^ Eexp(— U). Then 
(Z„), = (4)i(U„ coin) + 1), (D)2 = 0((4)s) = 0(ln« m^jn) and 

M 

A{pin),P(n)) = ^ /i,,(72)exp(5(/i)) , 
ii=i 

where for a given l\ and co{n) = C„ / Inn and n /{s\m^) = Inn -H c„, we 
have 



Bih) 



= l+ln» 

h=\ 



sUn^ 



■ (In n -I- c„ ) 1-1 — ; h 1 M- In n . 

\ Inn J \ inn J 



Now we have to consider three cases: 

Case I: If C„ = oil), then 5(/i) = o(l) — c„ and 



M 



Aipin),Pin)) = ^/7,,(n)exp(o(l)-c„) = (1 -|- o(l)) exp(-c„). 



/i=i 



Case II: If Cn = 1, then B(l\) = o{\) — a/j — EU,i — and 

M 

Aipin),^in)) = ^/?/j(n)exp(o(l)-Q!/, - EU„ - c„) 
h=i 

= (1 -h o(l))exp(-c„ -EU„)E(exp(-lLJ„))- 



Case III: If C„ oo, EU = 0 and U have at least two values, then for all ai < 0, 
Bih) — ^ oo, while for all ai > 0, Bih) —oo. Therefore 



Aipin),Piii)) ->■ oo. 



In Cases I and II, Theorem I follows from (9) and the moments method. In Case III, 
it is the consequence of (9) for k = I and 2 and the second moment method. 
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5 Remarks on Other Distributions 

Proof ( of Corollary 1 ). For the distribution with point mass in one point, Pr{lU„ = 
0} = 1 and for 0 ){n) = o(\/ Inn), Theorem 1 implies the result. 

Proof (of Corollary 2). For the binomial distribution it follows from (1), that 

inp m^lnn jn . 

If m = for S > \/s, then from the Chernoff bound for = o(l /Inn) and 
= o{^„) with probability 1 — o(l/ n), 

d- = [mp(l - t,n)\ < Z„ < \mp(l + t,nf\ = d+ 

Since (d-)l = (\jnp\)l — o{\ /Inn) and (d+)l = (\rnp\)l + o(l /Inn), thus 
Corollary 2 follows from Corollary 1 and Lemma 4 in Bloznelis et al. (2009). 
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Strengths and Weaknesses of Ant Colony 
Clustering 



Lutz Herrmann and Alfred Ultsch 



Abstract Ant colony clustering (ACC) is a promising nature-inspired technique 
where stochastic agents perform the task of clustering high-dimensional data on a 
low-dimensional output space. Most ACC methods are derivatives of the approach 
proposed hy Lumer and Faieta. These methods usually perform poorly in terms of 
topographic mapping and cluster formation. In particular when compared to clus- 
tering on Emergent Self-Organizing Maps (ESOM). 

In order to address this issue, a unifying representation for ACC methods and 
Emergent Self-Organizing Maps is derived in a brief yet formal manner. ACC 
terms are related to corresponding mechanisms of the Self-Organizing Map. This 
leads to insights on both algorithms. ACC are considered as first-degree relatives of 
the ESOM. This explains benefits and shortcomings of ACC and ESOM. Eurther- 
more, the proposed unihcation allows to judge whether modifications improve an 
algorithm’s clustering abilities or not. This is demonstrated using a set of cardinal 
clustering problems. 

Keywords Clustering ■ Emergent self-organizing maps ■ Swarm intelligence. 



1 Introduction 

Flocking behaviour of social insects has inspired various algorithms in numerous 
research papers over the last decade due to the ability of simple interacting entities 
to exhibit sophisticated self-organization abilities. A particularly interesting held 
of application is cluster analysis, i.e., the retrieval of groups of similar objects in 
high-dimensional spaces. The idea behind Ant Colony Clustering (ACC) is that 
autonomous stochastic agents, called ants, move data objects on a low-dimensional 
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regular grid such that similar objects are more likely to be placed on nearby grid 
nodes than dissimilar ones. This task is referred to as topographic mapping. 

Most popular ACC methods are based on the algorithm proposed by Lumer 
and Faieta (1994). The most advanced derivative might be ATTA (Adaptive Time 
Dependent Transporter Ants; Handl, Knowles, & Dorigo, 2006). ACC methods 
are known for at least two flaws: results are highly dependent on parametriza- 
tion (Aranha & Iba, 2006) and even ATTA has found to be “not competitive to 
the established methods of Multi-dimensional Scaling or Self-Organizing Maps” 
(Handl et al.) in terms of topographic mapping. 

In the following sections, the basic ACC algorithm by Lumer/Faieta is introduced 
in a notation consistent with the well-known Batch-SOM. A unifying representation 
for both methods is therefore derived in Sect. 3. Sections 4 and 5 describe how to 
improve topographic mappings of ACC methods on basis of Batch-SOM. Finally, 
in Sect. 6 the effect of altered objective functions is empirically verified. 



2 Ant Colony Clustering 

The ACC method proposed by Lumer and Faieta (1994) operates on a fixed regular 
low-dimensional grid G C IN^. A finite set of input samples X from a vector space 
with norm ||.|| is projected onto the grid by m : X ^ G. The mapping m is altered 
by autonomous stochastic agents, called ants, that move input samples x e A from 
m (x) to new location m'{x). Ants move randomly on neighbouring grid nodes. Ants 
might pick input samples when facing occupied nodes and drop input samples when 
facing empty nodes. The probability for picking input sample x e A from node 

i = m{x) and dropping picked x on node j e G is Ppick.xiO = (^ ki+^ (;) ) 

Pdrop.x(J) = ’ respectively. Here, k\,k 2 e R"'’ are threshold constants. 

(j>x{i) denotes the average similarity between x e A and input samples located 
on the so-called perceptive neighbourhood. Usually, the perceptive neighbourhood 
consists of e {9,25} quadratically arranged nodes at which the ant is located 
in the center. The set of input samples mapped onto the perceptive neighbourhood 
around i e G is denoted with Nx(k) = {y e A : y ^ x, m{y) neighbouring /}. In 
this context, <p is referred to as objective function since its minimization determines 
the ants’ probabilistic modifications of mapping m : A — ^ G 



ACC methods lead to a local sorting of input samples on the grid in terms of 
similarities. Ants gather scattered input samples into dense piles. In literature, it has 
been noticed that ACC derivatives are prone to produce too many and too small 
clusters (Aranha & Iba, 2006 Handl et al., 2006). For illustration see Fig. 1. 




( 1 ) 
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Fig. 1 Typical result of ACC methods. From left to right: gaussian data with four clusters, initial 
mapping of data objects, dense clusters appear, too many clusters with topological defects have 
finally emerged (Aranha & Iba, 2006) 



3 Analysis of Ant Colony Clustering by Means 
of Self- Organizing Batch Maps 



In order to compare Self-Organizing Maps (SOM) and Ant Colony Clustering 
(ACC), a unifying basis for both algorithms is derived. Input data X and output 
grid G C IN^ are identical and mapping function m : X ^ G is iteratively update 
in both cases as well. 

Self-Organizing Batch Maps (Batch-SOM) are well-known artificial neural net- 
works that consist of grid G, codebook vectors w, e R" , i e G and a mapping 
function m : A — ^ G with m(x) = argmin,gG \\x — vr, ||. The codebook vectors 
are defined according to (2) at which : G x G — > [0, 1] denotes a time-dependent 
neighbourhood function. An update of m : X ^ G leads to an update of codebook 
vectors w, ,i € G and vice versa. This is how the Batch-SOM modifies mapping 
m : X ^ G. For details see (Kohonen, 1995). 

In literature (Ultsch & Morchen, 2006), two main types of Self-Organizing Maps 
(SOM) can be distinguished: first, SOM in which each codebook vector represents 
a single cluster of input samples. In contrast to that, SOM may be used as tools 
for visualization of structural features of the input space. A single codebook vector 
is meaningless. A characteristic of this paradigm is the large number of codebook 
vectors, usually several thousands (>4,000). These SOM are referred to as Emergent 
Self-Organizing Maps (ESOM). For details see Ultsch and Morchen (2006) 



T,x€xHm(x),i) ' 



( 2 ) 



A meaningful objective function for the Batch-SOM is derived from the quantiza- 
tion error \\x — w, || because its minimization determines the update of m : X — ^ G. 
Resolving the quantization error with (2) leads to objective function ch of the 
Batch-SOM [see (3)]. represents the norm of averaged differences x — y over 
grid-neighbouring input samples y € X 
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4>x0) 



'Ly^xHm{y),i){x - y) 
Hy€xh{m{y),i) 



(3) 



In the following, the mechanism of picking and dropping ants is no longer subject 
of consideration. In Tan, Ting, and Teng (2006) it was shown that collective intel- 
ligence can be discarded in ACC systems, i.e., same results were achieved without 
ants but using objective function <p directly for probabilistic cluster assignments. 
This simplification is evident: over a sufficient period of time, randomly moving 
ants may select any arbitrary subset of input samples, but re-allocation through 
picking and dropping depends on (p only. Probability of selection is the same on all 
input samples such that ants might be omitted in favor of any other subset sampling 
technique. 

A meaningful symmetrical neighbourhood function h : G x G — ^ [0. 1] for 
ACC methods is defined according to the perceptive neighbourhood of ants, i.e., 
h{i,j) is 1 if 7 e G is located in the perceptive neighbourhood of node i e G 
and 0 elsewhere. This neighbourhood function allows to restate (p as (4) by use of 
|A^;c(OI = T.yexh(My)J) 



<Px{i) = 




with <I>j^ (;■ ) 



EyeA^('"(T).OII-y-Tll 



. (4) 



The ACC error function cp = ^(1 — incorporates <!>' that is a weighted sum of 
local input space distances. Obviously, <!)' measures the local stress of topographic 
mapping /u : X — ^ G, comparable to <I) of the Batch-SOM. <!>' even acts as an upper 
limit to <I> since Vx e X, / e G : d’xCO < Due to that 1 — ^ is referred to 

as topographic term of ACC algorithms. 

The term estimates the output space density around grid node i e G. 

Therefore, it is referred to as output density term of ACC algorithms. A unify- 
ing framework for analysis and assessment of Batch-SOM and ACC exists by 
means of objective functions <I) and <p. Both functions are denoted by means of 
three functions: norm ||.||, neighbourhood h : G x G ^ [0,1] and mapping 
m: X ^G. 

This leads to the following insights: The ACC method uses a fixed neighbour- 
hood function with small radius, whereas Batch-SOM uses shrinking neighbour- 
hood functions with large radiuses. ACC has a probabilistic update of mapping 
m : X ^ G, whereas Batch-SOM is deterministic. The objective function of ACC 
algorithms decomposes into an output density term and a term 1 — ^ related 
to topographic quality, ch' is easily identified as a topographic distortion measure 
because of its relation to $ of Batch-SOM. Therefore, the ACC algorithm is easily 
convertible into a special case of Batch-SOM, and vice versa. For a brief overview 
of differences see Table 1. 
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Table 1 Differences of Batch-SOM and Ant Colony Clustering (ACC) 





Batch-SOM 


ACC 


Neighbourhood 


Large, 


Small, 


/j : G X G ^ [0, 1] 


shrinking 


fixed 


Update of m : X ^ G 


Detenninistie 


Probabilistic 


Searehing for 


Global 


Local 


update of m : X G 


G 


C G 


Objective function 


<J) 


|A^I /I O' \ 

a2 ^ * a ) 


Termination 


Cooling scheme 


Never 



4 Improvement of Ant Colony Clustering 

ACC methods are prone to produce bad topographic mappings, e.g., too many, too 
small and topographically distorted clusters. If one regards ACC as a derivative of 
the Batch-SOM, improvement of topographic mapping can easily be achieved. 

Maximization of the topographic term 1 — ^ corresponds to minimization of <!>' 
and <I>, too. This is known to produce sufficiently topography preserving mappings 
m : Z — ^ G, e.g., when using Batch-SOM (Kohonen, 1995). 

In contrast to that, the output density term has some major flaws. First, the 
output density term leads to maximization of output space densities, instead of 
preservation. Obtained mappings are, therefore, not related to the configuration of 
available clusters in the input space. Traditional ACC algorithms are not allowed to 
assign two or more objects to a single grid node (see Sect. 2) in order to prevent the 
mapped clusters from collapsing into a single grid node. Due to that, densities of 
input data can hardly be preserved on grid G. In comparison with the topographic 
term, the output density term is much easier to maximize and, therefore, will dis- 
tort the objective function <j). Accounting of output densities is prone to distort the 
formation of correct topographic mappings because it is responsible for additional 
local optima of <j). 

The topographic term 1 — ^ of the ACC objective function depends on the shape 
of the neighbourhood function /; : G x G — {0, 1}. Usually, the neighbourhoods’ 
sizes are chosen as e {9. 25}, i.e., the immediate neighbours. From the Batch- 
SOM it is known that the cooling scheme of the neighborhood radius influences 
the goodness for topographic mapping very strongly (see Nybo, Venna, & Kaski, 
2007 for details). A bigger radius enables a more continuous mapping in the sense 
that proximities existing in the original data are visible on the grid. This is evident 
because smaller neighbourhoods are more likely to exclude parts of a cluster. 

In order to cope with the shortcomings mentioned above, we introduce the Emer- 
gent Ant Colony Clustering method. An ACC method is said be emergent if it fulfills 
the following conditions: 

• Ants’ modifications of mapping m : X ^ G is directed by maximization of 

1 — ^ and minimization of <!>', respectively. 

• Ants do not account for output densities. 
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• The perceptive neighbourhood of ants is not limited to immediate neighbours 
on grid G. Instead, bigger neighbourhood radiuses are to be chosen in order to 
obtain ESOM-like mappings. 



Figure 2 illustrates the ability of emergent ACC method to preserve even looped 
input space clusters, which is hardly possible for traditional ACC. 




(e) emergent SOM 



Fig. 2 ACC projects looped cluster structures on a toroid grid, (a) Chainlink data from FCPS 
(Fundamental Clustering Problem Suite, n.d.). (b) Traditional ACC with small a produces too 
many small clusters, (c) Traditional ACC with big a produces fewer clusters, but no loops, (d) 
Emergent ACC enables the formation of looped clusters, (e) Emergent SOM enables the formation 
of looped clusters 
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Fig. 3 Well-known iris data (Fisher, 1936): setosa (crosses), versicolor (triangles), virginica 
(squares). U-Maps shown as islands generated from toroid grids. Dark shades of gray indicate high 
inter-cluster distances, (a) Too many small clusters emerge from traditional ACC. (b) Emergent 
ACC preserves three clusters after the same amount of time 



5 Data Analysis with Emergent Ant Colony Clustering 

Emergent ACC usually will provide an ESOM-like projection, i.e., input samples 
are uniformly mapped onto the grid. See Fig. 2 for illustration. In this case, cluster 
retrieval cannot he achieved according to sparse regions dividing dense clusters on 
the grid. 

A promising technique for cluster retrieval is based on so-called U-Maps (Ultsch 

6 Morchen, 2006). Arbitrary projections from normed vector spaces onto grid 
G C IN^ are transformed into landscapes, so-called U-Maps. The U-Map technique 
assigns each grid node a height value that represents the averaged input space dis- 
tance to its neighbouring nodes and codebook vectors, respectively. Clusters lead 
to valleys on U-Maps whereas empty input space regions lead to mountains divid- 
ing the cluster valleys. This is illustrated in Fig. 3 using Fisher’s well-known iris 
data (Fisher, 1936). Traditional ACC produces too many valleys, whereas Emergent 
ACC preserves cluster structures. 

The U*C cluster algorithm uses the so-called watershed transformation to retrieve 
cluster valleys on U-Maps. See Ultsch and Herrmann (2006) for details. 



6 Experimental Settings and Results 

In order to measure the distortion of a topographic mapping method in question, a 
collection of fundamental clustering problems (FCPS) is used (Fundamental Clus- 
tering Problem Suite, n.d.). Each data set represents a certain problem that arbitrary 
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algorithms shall be able to handle when facing unknown real-world data. Here, tra- 
ditional and emergent ACC are tested on which one delivers the best topographic 
mapping. 

A comprehensive overview on topographic distortion measurements can be 
found in Goodhill and Sejnowski (1996). Here, the so-called minimal path length 
(MPL) measurement is used. It is an easy-to-compute measurement that sums 
up input space distances of grid-neighbouring data objects and codebook vectors, 
respectively 



mpl = ^ 



x€X 



1 



.veAx 



(5) 



Lower MPL values indicate less topographic distortion when moving on the grid 
and, therefore, a more trustworthy topographic mapping. Each algorithm is run 
several times with the same parametrization. MLP values indicate if accounting 
for output densities assists the formation of good topographic mappings, or not. 
All data sets from the FCPS collection were processed with the same parameters 
established in literature, i.e., a = 0.5, = 25, k\ = 0.3 and /c 2 = 0.1 on a 

64 X 64 grid with 100 ants during 100,000 iterations. The results are illustrated 
in Fig. 4. Accounting for output densities leads to increasing MPL values on an 
average, i.e., worsenings of topographic mappings. Significance has been confirmed 
using a Kolmogorov-Smirnov test on a a = 5% level. All obtained /(-values are 
below 10“^. 



450 




atom chainlink hepta iris target 2diamonds wingnut 



Fig. 4 Improvement of topographic quality measured by minimal path length method: percental 
z-scores of traditional over emergent ACC. Emergent ACC leads to improvements between 50% 
and 400% when compared to traditional ACC on different FCPS data sets 
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7 Discussion 

This work shows a previously unknown relation of two topographic mapping tech- 
niques, namely Self-Organizing Batch-Maps and Ant Colony Clustering (ACC). It 
is based on the assumption (Tan, Ting, and Teng, 2006) that stochastic agents, e.g., 
ants, are nothing more than an arbitrary sampling technique that is to be omitted 
for further analysis of formulae. This simplification is evident but may be invalid 
for stochastic agents guided by more than just randomness and topographic distor- 
tion, e.g., ants following pheromone trails. Our analysis of formulae does not cover 
popular algorithms that are not ACC derivatives following the Lumer/Faieta scheme. 

Minimal path lengths (MPL), as proposed in Sect. 6, are well-known topographic 
distortion measures. The length of paths is normalized by the cardinality \N^ \ of the 
corresponding grid neighbourhood, i.e., the number of objects mapped onto the grid 
neighbourhood. This is supposed to decrease error values of locally dense mappings, 
as produced by traditional ACC, because small radial neighbourhoods usually do not 
cover objects of another cluster, since locally dense mappings imply sparse dividing 
grid regions around clusters. Nevertheless, traditional ACC produces bigger MPL 
errors than emergent ACC that is not accounting for densities. We conclude that the 
topographic mapping quality is improved beyond our empirical evaluation. 

Traditional and emergent ACC methods do not converge due to the architecture 
of stochastic agents. Instead, they enable perpetual machine learning. ACC methods 
are, therefore, to be favored over traditional methods, like Self-Organizing Maps and 
hierarchical clustering, when dealing with incremental learning tasks. In contrast 
to Self-Organizing Maps, ACC methods enable the creation of topographic maps 
despite the absence of vector-space axioms, i.e., when pairwise (dis)similarity data 
is available only. 



8 Summary 

To the best of our knowledge, this is the first work that shows how the Ant Colony 
Clustering (ACC) method by Turner and Faieta (1994) is related to Self-Organizing 
Maps (Kohonen, 1995). The mechanism of picking and dropping ants was omit- 
ted in favor of a formal analysis of the underlying formulae and comparison with 
Kohonen’s Batch-SOM. It could be shown that a unifying framework for both meth- 
ods does exist in terms of closely related topographic error functions. The ACC 
method is to be considered a probabilistic, first-class relative of the Batch-SOM. 
The behaviour of ACC methods becomes explainable on that unifying basis. 

ACC methods exhibit poor clustering abilities because of distorted topographic 
mappings. Improvements of topographic mapping were derived by means of SOM 
architecture. Perceptive areas are to be increased, and accounting for density of 
mapped data is futile. The obtained method Emergent ACC does not produce dense 
clusters any more but uniformly distributed, SOM-like projections. Due to that, 
clusters are to be retrieved using U-Map technology. As predicted by our theory. 
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an empirical evaluation showed on few clustering problems that not-accounting for 
density of mapped data improves the quality of topographic mapping despite of 
unfavorable settings. 
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Variable Selection for Kernel Classifiers 
A Feature-to-Input Space Approach 



Surette Oosthuizen and Sard Steel 



Abstract An aspect of kernel classifiers which complicates variable selection is the 
implicit use of the transformation function <f). This function maps the space in which 
the data cases reside, the so-called input space (X), to a higher dimensional /eafwre 
space (if). Variable selection in JT is a difficult problem, while variable selection 
in X is mostly inadequate. We propose an intermediate kernel variable selection 
approach which is implemented in X while also accounting for the fact that kernel 
classifiers operate in T . 

Keywords Binary classification • Feature space ■ Kernel classifiers • Variable 
selection. 



1 Introduction 

The context of the work in this paper is binary classification problems. Consider 
therefore a response variable Y e {— 1 ,-|- 1 } which is observed together with a 
(large) number of variables Ai, A2, . . . , for n = iii + 112 sample cases. The 
«i cases with indices in Ii belong to population 1 (where we have Y = -|- 1 ). The 
remaining «2 cases have indices in I2 and belong to population 2 . Thus the training 
data set is T = {(jc, , y, ), f = where Xi is a / 7 -component vector contain- 

ing the values of Ai, A2, . . . , A^ for case i in the sample. The space to which r 
belongs is called the input space, denoted by X. In classification the objective is to 
use the training data to determine a discriminant function, / (jr), so that sign{f (j:)} 
can be used to assign a new case with observed values of the classification variables 
in the vectors, to one of the two populations. 

Many procedures have been proposed for dealing with binary classification prob- 
lems. A class of methods which has been shown to perform particularly well in most 
data scenarios is the class of kernel classifiers. Perhaps the most popular kernel 
classifiers are the support vector machine (SVM), kernel logistic regression (KLR), 
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and kernel Fisher discriminant analysis (KiFDA). An advantage of kernel classifiers 
is their good performance in applications where p is (much) larger than n, e.g., in 
image and genomic data analyses. 

For an introduction to kernel classifiers the reader is referred to Scholkopf and 
Smola (2002), and Shawe-Taylor & Cristianini (2004). For the purposes of this 
paper, a relatively simple explanation of the basic principles underlying an SVM 
will suffice - cf. Bierman and Steel (2009) in this regard. 

In this paper we consider the problem of variable selection for kernel classifiers, 
a topic which is currently receiving much attention in the literature. Some of the 
more recent papers include Claeskens, Croux, and Van Kerkhoven (2008), Keerthi 
(2005) and Zhang (2006). 

The remainder of the paper is organised as follows. In Sect. 2 we present our pro- 
posal for a feature-to-input space approach to kernel variable selection. A numerical 
evaluation of the proposed selection procedure is outlined in Sect. 3, followed by 
results and a brief discussion in Sect. 4. Experiments on benchmark data sets are 
reported in Sect. 5. We conclude the paper in Sect. 6. 



2 Feature-to- Input Space Variable Selection 

Since the function <I> plays an integral role in non-linear kernel classifiers, it seems 
reasonable that a kernel variable selection procedure should incorporate this aspect. 
On the other hand, in terms of simplicity and interpretability, it would be conve- 
nient for variable selection to associate a single coefficient with each variable, and 
base selection on the absolute sizes of these variable coefficients, as is for example 
frequently done in a multiple linear regression model. In this section we propose 
and evaluate variable selection approaches implemented in X but also take into 
account the fact that kernel methods operate in T . This so-called feature-to-input 
space (FI) approach may thus be viewed as an intermediate approach bridging 
the gap between naive selection approaches in X, and more sophisticated selection 
procedures performed entirely in T. 



2.1 F I -Selection Based on the Group Means 

Variable selection in T is considerably more difficult than selection in X - mainly 
because calculations in T are restricted to evaluation of inner products between 
elements of T . In kernel classification evaluation of inner products is sufficient and 
there is no need for a closer examination of any other quantities in T . Variable 
selection, however, seems to be an example of a problem where it would be useful 
to examine certain vectors in T more closely. This is often a formidable problem, 
especially since T may be infinite dimensional. The concept of the pre-image in X 
of a vector in T seems to be worth investigating in this regard. 
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Consider therefore a linear expansion § = §(a) = in T, where 

xj/i = i = 1,2 n, with vectors X1.X2, ■ ■ ■ ,x„, belonging to X. The 

pre-image of § is a vector z = z(§) e X such that <I>(z) = In Scholkopf and 
Smola (2002 p. 544), a useful result is provided for calculating this pre-image if it 
exists. Unfortunately, in many cases a pre-image for § does not exist. In fact, for 
the Gaussian kernel which we focus on, Scholkopf and Smola (2002 pp. 545-546) 
argue that the pre-image of § will only exist in trivial cases consisting of a single 
term. We are therefore forced into considering the somewhat less satisfactory option 
of finding an approximate pre-image for a given linear combination § . The vector 
'z € X is called an approximate pre-image of § if 



is small. As pointed out by Scholkopf and Smola (2002 p. 546), the meaning of 
small in this definition will depend on the particular application. In order to calcu- 
late an approximate pre-image, we follow an approach discussed by Scholkopf and 
Smola (2002 pp. 547-548). 

How can the concept of (approximate) pre-images be utilised for variable selec- 
tion? We investigated various possibilities empirically and found the following two 
options to be most promising. Since our focus is on binary classification, it seems 
a sensible idea to select those variables maximising some measure of the difference 
between the groups of data points corresponding to the two populations. Consider 
in this regard therefore the respective group means in JF, viz. 



and their approximate pre-images, viz.'z(^ j) and'z(^ 2 )- Our first proposal for vari- 
able selection based on pre-images is to select the variables corresponding to the 
largest absolute components of the difference vector S = 'zitfi) — "z(^ 2 )- This pro- 
posal is based on the assumption that variables which maximally separate the two 
groups in terms of the pre-images in X of their mean vectors in T, will also be the 
approximate variables for separating the groups in T. Note that this proposal is inde- 
pendent of the specific kernel classifier being used. This variable selection proposal 
is investigated further in the simulation study, and we refer to it as a feature-to-input 
space proposal based on mean vectors, abbreviated to FI{M). 



2.2 FI -Selection Based on the Kernel Weight Vector 

It is also possible to propose a variable selection criterion based on pre-images and 
using information derived from the specific kernel classifier. Consider in this regard 
the kernel classifier weight vector w = ^”=1 ot-i'fi, where the scalars , q! 2 , . . . , Q!„ 



p(z) = ||§ - <I>(z)f 



( 1 ) 




( 2 ) 
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are determined from the training data according to some kernel algorithm. We can 
write this weight vector as 



In line with our hrst proposal we now consider the vector S = "z(wi) — ’zijwa) and 
propose to select the variables corresponding to the largest absolute components of 
this vector. Here, z"(wi ) andT(w 2 ) are once again the approximate pre-images of Wi 
and W 2 respectively. This proposal, called feature-to-input space variable selection 
based on the weight vector and abbreviated to FI(W), is also investigated more 
thoroughly in the simulation study. 



3 Simulation Study 

In order to evaluate and compare the performances of the above two selection 
approaches, we conducted a fairly extensive simulation experiment. Naturally there 
are many factors influencing the post-selection properties of kernel classifiers, mak- 
ing it virtually impossible to conduct an exhaustive investigation. In our simulation 
studies we attempted to investigate the influence of the most important of these 
factors, which we now proceed to describe: 

1. The underlying distribution from which the variables arise. Two cases were 
considered: a multivariate normal, and a multivariate lognormal distribution. 
These distributions were selected as representative of symmetric and asymmetric 
distributions respectively. 

2. The manner in which the two groups differ. Firstly, we looked at situations where 
the groups differ with respect to location, and then only for a subset of m of the 
available variables (see 5. below). We refer to this subset of variables as the set 
of relevant variables, contained in Vs, with card{Vs) = m. The set of variables 
which do not contribute to differences between the two groups will be denoted by 
Vg. Secondly, situations were investigated where the two groups differ only with 
respect to spread (variance-covariance structure) - once again only for variables 
contained in a subset V$ of the set of available variables. 

3. The dimension of the problem, i.e., the total number p of available variables, 
which was 10 in cases where p < n and 60 whenever p > n. 

4. We varied the training sample size, investigating three cases: n\ = «2 = 15, 
«i = «2 = 100, and it\ = 25; «2 = 75. Combining these different sample sizes 
with the different values of p specified in 3, gives us the following cases which 
were investigated: n\ = «2 = 15 and p = \0 (referred to as small samples in the 
further discussion); «i = 25; «2 = 75 and p = 10 (referred to as mixed samples 
in the further discussion), and ii\ = 112 = 100 combined with p = 10 (referred 
to as large samples in the further discussion). In addition we also investigated 
the scenario «i = ii 2 = 15 and p = 60, referred to as wide samples in the 




(3) 
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further discussion, since the standard “data case” hy “variables” input matrix has 
a “wide” appearance for such data sets. 

5. The fraction of relevant input variables contributing towards separation between 
the two groups, i.e., ji = mj p. We used two fractions: 0. 1 (thus, 1 out of 10, and 
6 out of 60), and 0.4 (thus, 4 out of 10, and 24 out of 60). 

6. Finally, we varied the dependence amongst the input variables as reflected in 
their correlation coefficients. Consider two distinct input variables, Xj and Xk- 
If Xj,Xk e Fj, we denote the correlation between Xj and Xk by ps, and 
we assume that all such pairs exhibit this correlation. If Xj,Xk e F^, we 
use the symbol pg for the common correlation between all such pairs, and if 
Xj e Vs,Xk G Vg, the common correlation coefficient is denoted by p^^. 
For normal input data we consistently assumed that all irrelevant variables were 
independent, i.e., we used = 0 throughout. For the relevant variables we 
used Ps = 0 and ps = 0.7, combining this with two values for the correla- 
tion between relevant and irrelevant variables, viz. pg^ = 0 and pgg = 0.9. 
For lognormal input data we also hxed pg = 0 and investigated the following 
cases: for ti = 0.1, we combined each of =0 and ps = 0.7 with each of 
Pgg = 0 and pgg = 0.2; for 7t = 0.4, we combined ps = 0 with pgg = 0, 
and Ps = 0.7 with pgg = 0.25. See Oosthuizen, 2008, Appendix A, Sect. 1 for 
a detailed motivation. 

We require notation to describe the cases arising from combination of the fac- 
tors described above at their different levels. Firstly, we write NL to denote cases 
where the training data were generated from two normal distributions differing with 
respect to location, NS for the cases where the two normal distributions differed 
with respect to spread, and LL and LS for the corresponding lognormal distribution 
cases. 

For each of the configurations described in the previous section we performed 
1,000 Monte Carlo repetitions, each repetition entailing generation of a training and 
test data set according to the specihcations of the case. The number of cases in the 
test data set was 2,000 throughout, with the percentages corresponding to the two 
groups the same as in the training data. Now consider any given relevant selection 
procedure. This procedure was applied to the training data, therefore obtaining a 
selected subset of m variables. After 1,000 Monte Carlo repetitions we therefore 
ended up with 1,000 (potentially) different sets of input variables identihed by this 
particular selection technique. For each selection technique we are interested in the 
generalisation error if the particular selection technique is applied. Thus for each 
selection technique we classified the test data cases using the relevant kernel classi- 
fier based only on the input variables selected from the corresponding training data 
set. The misclassification error rate was calculated as the proportion of incorrectly 
classihed test cases, and we used the average over the 1 ,000 Monte Carlo repetitions 
of these miclassification error rates as an estimate of the generalisation error associ- 
ated with the given selection proposal. For a more detailed description of the way in 
which the NL, NS, LL and LS data sets were generated, cf. Oosthuizen (2008). 

The performances of FI{M) and FI(W) were compared with those of two base- 
line selection criteria/methods: the alignment (cf. Cristianini et ah, 2002), and RFE 
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based on the zero-order norm of the SVM weight vector (cf. Claeskens et al., 2008). 
Since both the alignment selection criterion, denoted F{A), and the norm of the 
weight vector, denoted F(N), are calculated in F, note that they make use of more 
information than the f /-criteria, and are computationally more expensive. 



4 Simulation Results and Conclusions 

In this section we present and briefly discuss the results of the simulation study 
described above. For easier comparison of the procedures, we calculated the size of 
the average test error obtained for a procedure X relative to that of the SVM based 
only on the relevant subset of variables, i.e., Err(X)l Err{Vs)- 

Since variable selection was shown not to be crucial in the NL setups, note that 
we restrict attention to the NS, LL and LS data sets. Moreover, since our conclu- 
sions for the cases = 0 and > 0 were very similar, we only report the 
Pgs = 0 results for the NS, LL and LS data in Tables 1-3. (The remainder of the 
results can be found in Oosthuizen, 2008.) 

In Tables 1-3, the performances of F{A) and F(N) definitely stand out. Only in 
the mixed NS and LS data cases do FI{M) and FI{W) outperform F(A), but not 
F(N). In the wide LL data sets, FI(M) and FI{W) outperform F(N). In the NS 
and LS data, F1{W) mostly performs better than FI{M), whereas in the LL data 
cases, FI{W) and FI{M) perform more similarly. 



Table 1 Selection values for NS data 







Ps 


Pss 


7t 


F(A) 


F(N) 


F1(M) 


F1(W) 


Small samples 


NSl 


0.0 


0.0 


0.1 


1.19 


1.30 


1.64 


1.63 




NS2 


0.0 


0.0 


0.4 


1.44 


1.27 


2.82 


2.58 




NS3 


0.7 


0.0 


0.1 


1.18 


1.31 


1.66 


1.65 




NS4 


0.7 


0.0 


0.4 


1.23 


1.39 


1.89 


1.84 


Mixed samples 


NS5 


0.0 


0.0 


0.1 


1.06 


1.03 


1.04 


1.04 




NS6 


0.0 


0.0 


0.4 


3.98 


1.04 


3.34 


2.66 




NSl 


0.7 


0.0 


0.1 


1.06 


1.04 


1.04 


1.04 




NS8 


0.7 


0.0 


0.4 


3.35 


1.32 


2.81 


2.28 


Large samples 


NS9 


0.0 


0.0 


0.1 


1.00 


1.02 


1.83 


1.82 




NSIO 


0.0 


0.0 


0.4 


1.00 


1.00 


3.11 


2.34 




Nsn 


0.7 


0.0 


0.1 


1.00 


1.01 


1.84 


1.80 




Nsn 


0.7 


0.0 


0.4 


1.00 


1.00 


2.88 


2.18 


Wide samples 


NS13 


0.0 


0.0 


0.1 


2.02 


2.07 


3.73 


3.74 




NSU 


0.0 


0.0 


0.4 


7.67 


3.17 


48.33 


45.67 




NS15 


0.7 


0.0 


0.1 


1.50 


2.03 


2.38 


2.44 




NS16 


0.7 


0.0 


0.4 


1.31 


1.34 


2.19 


2.38 
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Table 2 Selection values for LL data 







Ps 


Pss 


JT 


F(A) 


F{N) 


FI(M) 


FI(W) 


Small samples 


LLl 


0.0 


0.0 


0.1 


1.05 


1.24 


1.42 


1.33 




LL2 


0.0 


0.0 


0.4 


1.03 


1.19 


1.15 


1.15 




LL3 


0.7 


0.0 


0.1 


1.06 


1.28 


1.32 


1.32 




LL4 


0.7 


0.0 


0.4 


1.02 


1.08 


1.09 


1.10 


Mixed samples 


LL5 


0.0 


0.0 


0.1 


1.00 


1.00 


1.05 


1.05 




LL6 


0.0 


0.0 


0.4 


1.02 


1.19 


1.02 


1.07 




LLl 


0.7 


0.0 


0.1 


1.02 


1.00 


1.05 


1.03 




LL8 


0.7 


0.0 


0.4 


1.00 


1.00 


1.00 


1.00 


Large samples 


LL9 


0.0 


0.0 


0.1 


1.00 


1.00 


1.03 


1.02 




LLIO 


0.0 


0.0 


0.4 


1.00 


1.03 


1.06 


1.06 




LLll 


0.7 


0.0 


0.1 


1.00 


1.00 


1.04 


1.03 




LL12 


0.7 


0.0 


0.4 


1.00 


1.07 


1.02 


1.30 


Wide samples 


LL13 


0.0 


0.0 


0.1 


1.10 


1.85 


1.56 


1.67 




LL14 


0.0 


0.0 


0.4 


1.10 


2.00 


1.48 


1.48 




LL15 


0.7 


0.0 


0.1 


1.07 


1.37 


1.36 


1.36 




LL16 


0.7 


0.0 


0.4 


1.04 


1.14 


1.13 


1.06 



Table 3 Selection values for LS data 







Ps 


Pss 


7t 


F(A) 


F{N) 


FI(M) 


FI(W) 


Small samples 


LSI 


0.0 


0.0 


0.1 


1.24 


1.31 


1.97 


1.92 




LSI 


0.0 


0.0 


0.4 


1.23 


1.34 


1.88 


1.74 




LS3 


0.7 


0.0 


0.1 


1.26 


1.35 


1.98 


1.92 




LS4 


0.7 


0.0 


0.4 


1.17 


1.40 


1.55 


1.65 


Mixed samples 


LS5 


0.0 


0.0 


0.1 


1.21 


1.09 


1.17 


1.13 




LS6 


0.0 


0.0 


0.4 


3.58 


1.19 


2.28 


1.84 




LSI 


0.7 


0.0 


0.1 


1.21 


1.09 


1.17 


1.14 




LS?, 


0.7 


0.0 


0.4 


3.19 


1.42 


1.65 


2.04 


Large samples 


LS9 


0.0 


0.0 


0.1 


1.01 


1.01 


1.66 


1.12 




LSIO 


0.0 


0.0 


0.4 


0.95 


1.00 


1.33 


1.07 




LSll 


0.7 


0.0 


0.1 


1.01 


1.02 


1.65 


1.12 




LS\2 


0.7 


0.0 


0.4 


1.00 


1.08 


1.12 


1.52 


Wide samples 


Lsn 


0.0 


0.0 


0.1 


1.51 


1.95 


2.98 


2.95 




LSU 


0.0 


0.0 


0.4 


1.79 


2.11 


4.39 


4.30 




LS\S 


0.7 


0.0 


0.1 


1.42 


1.91 


2.34 


2.42 




LS16 


0.7 


0.0 


0.4 


1.41 


1.80 


1.97 


2.14 



5 Application to Data Sets 



We further investigated the merit of FI{M) and FI{W) on a sample of realworld 
data sets from the UCI Machine Learning Repository. These were the BUPA Liver 
Disorder data, The John Hopkins University Ionosphere data, the PIMA Indian Dia- 
betes data, the Sonar, mines vs. rocks data, and the Wisconsin Breast Cancer data. 
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Table 4 Application to data sets: average test errors and standard errors 





Cancer 


Diabetes 


Ionosphere 


Liver 


Sonar 


n 


683 


768 


351 


345 


208 


p 


9 


8 


33 


6 


60 


m 


3 


4 


20 


5 


10 


F(C) 


0.042(0.010) 


0.227(0.010) 


0.104(0.010) 


0.280(0.010) 


0.214(0.010) 


{r,C) 


(0.2; 2) 


(0.125; 50) 


(0.2; 2) 


(0.01; 45) 


(0.2; 10) 


FI(M) 


0.032(0.003) 


0.232(0.006) 


0.085(0.005) 


0.285(0.010) 


0.081(0.011) 


(r,c) 


(0.2; 50) 


(0.125; 50) 


(0.1; 2) 


(0.2; 15) 


(0.2; 10) 


FI(W) 


0.039(0.004) 


0.222(0.004) 


0.054(0.007) 


0.297(0.015) 


0.081(0.011) 



The number of cases (n) and the number of variables (p) in each data set are given 
in Table 4. We compared the performances of FI(M) and FI{W) with the pro- 
posal by Zhang (2006), denoted in this section by FiC). Following Zhang (2006), 
for each data set we randomly selected two thirds of the data for training, and the 
remaining third for testing. During the randomisation the relationship between the 
group sizes in the resulting training and test sets were the same as in the original 
data set. Similar to the evaluation in Zhang (2006), we repeated the randomisation 
ten times, and calculated an average error on the test data. During each of the ten 
randomisations, various hyperparameter specifications were considered. In the case 
of for example the Liver and Diabetes data, we evaluated our technique using ker- 
nel hyperparameter values y = l//>, 0.1 and 0.2. For the Cancer data, we used 
y = 1/ p, 0.01 and 0.2. Moreover, for the Liver data we used cost parameter values 
C= 15, 20, 25, 30, 35 and 45, while for the remaining data sets we used C= 1, 1.5, 
2, 10, 50 and 100. For each selection method the SVM kernel- and cost parameter 
values where minimum average test errors were obtained, are provided in Table 4. 
Regarding the number of variables to select, we experimented with various numbers 
and report the best of these as appropriate values for m in Table 4. 

From the table it can be seen that for the Cancer, Ionosphere and Sonar data, 
F1(M) and F1{W) were far superior to F{C). For the Diabetes data, only F1(W) 
outperformed F(C). There was no clear winner between F1(M) and F1(W)\ for 
the Cancer and Liver data F I (M) performed better, while for the Diabetes and 
Ionosphere data the opposite was true. In the Sonar data case, the two selection 
techniques achieved equal test error rates. 

We further summarise the performances of FI{M) and FI(W) in Table 5 below. 
In the NO rows we report the test errors obtained when no selection was done. 
The Msel and Wsel rows report selection values Err{FI{M))/ Err{NO) and 
Err{FI{W))/ Err(NO) respectively. 

From Table 5 the benefit from using either F1(M) or F1{W) is clearly evident, 
especially so in the case of the Cancer and Ionosphere data. Comparing the Msel 
and Wsel rows, F1(M) seems to be the selection criterion of choice. 
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Table 5 Application to data sets: selection values 





Cancer 


Diabetes 


Ionosphere 


Liver 


Sonar 


(y;Q 


(0.01; 100) 


(0.125; 100) 


(0.2; 2) 


(0.2; 25) 


(0.2; 10) 


NO 


0.139(0.012) 


0.319(0.011) 


0.205(0.037) 


0.337(0.013) 


0.111(0.010) 


FI(M) 


0.042(0.005) 


0.256(0.006) 


0.085(0.005) 


0.339(0.018) 


0.081(0.011) 


Msel 


0.30 


0.80 


0.41 


1.01 


0.73 


ir,c) 


(0.01; 100) 


(0.2; 10) 


(0.2; 1) 


(0.2; 25) 


(0.2; 10) 


NO 


0.139(0.012) 


0.285(0.008) 


0.108(0.034) 


0.337(0.013) 


0.111(0.010) 


FI{W) 


0.053(0.005) 


0.239(0.009) 


0.067(0.006) 


0.316(0.015) 


0.081(0.011) 


Wsel 


0.38 


0.84 


0.62 


0.94 


0.73 



6 Summary 

In this paper we proposed a new approach for kernel variable selection: so-called 
feature-to-input space selection. The basic idea underlying this approach is to 
combine the information obtained from feature space computations with the easy 
interpretation in input space. Two strategies arose: the first was based on the two 
group means in T, and the second strategy required calculation of pre-images of the 
kernel weight vector in feature space. 

The results of an empirical study investigating the properties of the new selection 
procedures were reported and discussed. The FI(W) procedure mostly outper- 
formed F1(M), with the exception of LL data sets. As expected, F{A) and F(N), 
being the criteria computed entirely in feature space, performed best overall. 

In the realworld data experiments FI(M) and FI{W) mostly outperformed the 
selection method of Zhang (2006). Based on test error rates it was hard to decide 
which of the selection methods yielded the better performance across all hve data 
sets. The selection values indicated FI{M) as the overall winner. 

We conclude that the new FI approach is promising, deserving of further explo- 
ration. 
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Finite Mixture and Genetic Algorithm 
Segmentation in Partial Least Squares Path 
Modeling: Identification of Multiple Segments 
in Complex Path Models 



Christian M. Ringle, Marko Sarstedt, and Rainer Schlittgen 



Abstract When applying structural equation modeling methods, such as partial 
least squares (PLS) path modeling, in empirical studies, the assumption that the 
data have been collected from a single homogeneous population is often unrealis- 
tic. Unobserved heterogeneity in the PLS estimates on the aggregate data level may 
result in misleading interpretations. Finite mixture partial least squares (FIMIX- 
PLS) and PLS genetic algorithm segmentation (PLS-GAS) allow the classification 
of data in variance-based structural equation modeling. This research presents 
an initial application and comparison of these two methods in a computational 
experiment in respect of a path model which includes multiple endogenous latent 
variables. The results of this analysis reveal particular advantages and disadvantages 
of the approaches. This study further substantiates the effectiveness of FIMIX-PLS 
and PLS-GAS and provides researchers and practitioners with additional informa- 
tion they need to proficiently evaluate their PLS path modeling results by applying 
a systematic means of analysis. If significant heterogeneity were to be uncovered by 
the procedures, the analysis may result in group-specific path modeling outcomes, 
thus allowing further differentiated and more precise conclusions to be formed. 

Keywords Finite mixture • Genetic algorithm • Heterogeneity ■ PLS path modeling ■ 
Segmentation. 

1 Introduction 

Applications of structural equation models (SEMs) are usually based on the assump- 
tion that the analyzed data stem from a single population (i.e., a unique global model 
represents all the observations well). However, in many real-world applications. 
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this assumption of homogeneity is often unrealistic, as individuals are likely to be 
heterogeneous in their perceptions and evaluations of latent constructs. Tradition- 
ally, heterogeneity in SEMs is taken into account by assuming that observations 
can be assigned to segments a-priori, on the basis of, for example, geographic 
variables or stated preferences. However, postulating homogenous segments based 
on such prior knowledge usually provides unsatisfactory outcomes as observable 
characteristics often gloss over heterogeneity (Wedel & Kamakura, 2000). 

Owing to the increasing dissemination of PLS (Wold, 1982; Lohmoller, 1989), 
research interest has only recently been devoted to the question of clustering in 
PLS path modeling. Not only have newly developed multi-group comparison pro- 
cedures for comparing segments derived from a-priori information or clustering 
methods been proposed lately, but also different approaches to detect latent classes, 
which generalize, for example, finite mixture, fuzzy regression, or genetic algo- 
rithm approaches to PLS. Sarstedt (2008) has reviewed the latent class approaches 
to segmentation in PLS path modeling, such as the decision-tree-based PATH- 
MOX (Sanchez & Aluja, 2007) method, the distance-measure-based PLS-TPM and 
REBUS-PLS approaches (Esposito Vinzi, Ringle, Squillacciotti, & Trinchera, 2007) 
PLS-GAS (Ringle & Schlittgen, 2007) and fuzzy regression techniques (Palumbo, 
Romano, & Esposito Vinzi, 2008), as well as the FIMIX-PLS (Hahn, Johnson, Her- 
rmann, & Huber, 2002; Ringle, 2006; Ringle, Sarstedt, & Mooi, 2009) approach, 
which builds on a finite mixture of latent classes. 

According to Sarstedt (2008), FIMIX-PLS can currently be regarded as the pri- 
mary approach to capture heterogeneity in PLS path modeling. The method has been 
integrated in the statistical software application SmartPLS 2.0 (Ringle, Wende, & 
Will, 2005) and, hence, is most commonly in PLS path modeling applications 
(Sarstedt, Schwaiger, & Ringle, 2009). However, FIMIX-PLS also reveals some 
disadvantages (Esposito Vinzi, Ringle, Squillacciotti, & Trinchera, 2007). Firstly, 
the approach relies on distributional assumptions to form latent classes, which is 
contrary to the methodological character of PLS. Moreover, the methodology only 
accounts for the heterogeneity in the inner path model estimates and does not 
provide a final clustering of data, but rather determines the probabilities of seg- 
ment membership in respect of each observation. Other distance measure-based 
PLS typological segmentation techniques have been designed to overcome this and 
other limitations. However, while PLS-TPM and REBUS-PLS are limited to reflec- 
tive measurment models, PLS-GAS has been specifically designed as a universal 
segmentation approach for PLS path modeling. 

PLS-GAS is a genetic search/hill-climbing hybrid that first uses a nondetermin- 
istic genetic algorithm to find a good partition by exploring the search space. The 
genetic algorithm does not process all possible assignments of N observations of 
a given number of K segments and, thus, has to cope with the potential problem 
of complexity in data partitioning (Maulik & Bandyopadhyay, 2000). Since there is 
no guarantee that the genetic algorithm will provide the global optimum solution, 
the best solution subsequently becomes the starting partition for a deterministic 
hill-climbing approach to locally improve (if possible) the initial outcome and to 
determine the final partition. 
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As PLS-GAS depicts the most recent advances in distance-based segmentation 
procedures for PLS path modeling, knowledge about the method’s performance 
is very limited. In addition, the literature does not provide a comparative analy- 
sis of this methodology. Consequently, the goal of this research is to provide an 
assessment of PLS-GAS and the most commonly used approach FIMIX-PLS by 
conducting a computational experiment. A comparison of the outcomes exposes the 
strengths and weaknesses of both approaches. 



2 Computational Experiment 

Computational experiments on causal models with latent variables require data to be 
generated for indicator variables that meet, after model estimation, the pre-specified 
parameters for relationships in the inner model, as well as in the outer models. Only 
few Monte Carlo simulation studies and computational experiments that require 
artificial data for pre-specified parameters have been presented for PLS thus far 
(Chin, Marcolin, & Newsted, 2003; Ringle & Schlittgen, 2007; Ringle, Wende, & 
Will, 2009; Ringle, Sarstedt, et ah, 2009). All of these publications generate data for 
latent variable scores in accordance with the pre-specified relationships in the inner 
model, thus subsequently obtaining data for the manifest variables with respect to 
the measurement models’ parameter pre-specifications. Consequently, these stud- 
ies confine their PLS analyses to reflective measurement models. However, the 
literature does not present suitable approaches that permit artihcial data to be gen- 
erated for pre-specified SEM parameters when formative measurement models are 
involved. Hence, this FIMIX-PLS computational experiment on artificial data also 
only draws on reflective measurement models. 

In contrast to previous experimental data analyses of FIMIX-PLS in the liter- 
ature, this research employs a PLS path model with greater complexity in that it 
uses more than one endogenous latent variable and a higher number of a-priori 
defined groups. Figure 1 presents the sample path model of the simulation study 
with three exogenous and two endogenous latent variables. Each latent variable 
has four manifest variables in its reflective measurement model operationalization 
with pre-specified outer loadings above 0.9. Measurement invariance does not rep- 
resent a critical issue in this computational experiment. Table 1 provides the a-priori 
determined inner weights yn, y 2 i, 722 and 732 between the three exogenous latent 
variables (^1, ^2 and ^3) and the two endogenous latent (rji and rj 2 ) as well as the 
inner weight j6i2 between the between rji and r ]2 in the PLS path model. 

Each of the four data groups consists of 200 artificial observations. A com- 
puter program by Ringle and Schlittgen (2007) in the matrix-oriented programming 
language GAUSS 9.0 (Aptech, 2008), which employs a fast acceptance-rejection 
algorithm, allows us to obtain data for the manifest variables that meet pre-specified 
PLS path model parameters and distributional characteristics very precisely. This 
approach is consistent with the descriptions, functionalities, and examples in the 
PRELIS 2 software package and manual (Joreskog & Sorbom, 1999), which was 
used, for example, by Chin et al. (2003) in an earlier PLS simulation study. 
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Fig. 1 PLS path modeling results on the aggregate data level 



Table 1 Artificial data 
generation scheme 





k = 1 


11 


k = 3 


II 


Yn 


0.9 


0.9 


0.1 


0.1 


Y21 


0.1 


0.1 


0.9 


0.9 


Y22 


-0.3 


0.1 


-0.8 


0.9 


Yi2 


-0.8 


0.9 


-0.3 


0.1 


f^\2 


0.1 


0.4 


0.1 


0.4 



The basic PLS algorithm is executed with the complete set of artificially gener- 
ated data (800 cases) to estimate the path model in Fig. 1. We use the SmartPLS 2.0 
(Ringle et al., 2005) software application for the PLS and FIMIX-PLS computations 
and follow the suggestions by Chin (1998) for a concise assessment of model esti- 
mations. The methodological implications of PLS (Wold, 1982; Lohmbller, 1989), 
especially with respect to its distribution-free character, do not permit the applica- 
tion of parametric global goodness of fit measures that are utilized in covariance- 
based SEMs. The results of partial model structures must satisfy the minimum 
requirements of certain non-parametric evaluation criteria (Chin, 1998) such as, for 
example, construct reliability pc (>0.6), indicator reliability (>0.5), average vari- 
ance extracted (>0.5) and bootstrap significance of formative measurement model 
and inner model relationships (Flenseler, Ringle, & Sinkovics, 2009). 

The evaluation reveals the reliability and validity of the PLS results’ outer and 
inner path model structures. For example, all outer loadings are well above the 
required minimum value of 0.7 and, thus, the average variance extracted (AVE) 
and the composite reliability pc clearly exceed the critical minimum value of each 
evaluation criterion. Additional tests provide support for the measures’ discriminant 
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validity. Despite of these good results, the inner path model exhibits only three out 
of five significant {p < 0.05) relationships (yn, yai and fin) and two paths (Y 22 
and 732 ) that are not significantly different from zero. Moreover, only the value 
of the latent endogenous variable rji is at a substantial level. In contrast, the latent 
exogenous variables explain almost no variance in respect of the latent endogenous 
variable r] 2 , which subsequently exhibits a value at a level which is close to zero. 

At this stage, an interpretation of PLS results that do not account for unob- 
served heterogeneity only concentrates on the significant relationships between the 
latent exogenous variables ^1 and ^2 and the latent endogenous latent variable r]i, 
which entail a substantial level of R^. However, the outcomes on the aggregate data 
level may be ambiguous in that they involve ineffective conclusions if unobserved 
heterogeneity significantly affects the PLS path model estimates. 

In the next step, we apply FIMIX-PLS on the experimental data set. Initially, 
FIMIX-PLS results are computed in respect of two segments. Thereafter, the num- 
ber of segments is successively increased and information criteria such as the 
consistent Akaike’s information criterion (CAIC) or the Bayesian information cri- 
terion (BIC) are computed. All heuristics reveal the best fitting outcome (i.e., 
the minimum value) for K = 4 classes, which is in accordance with the pre- 
specified data characteristics. Consequently, FIMIX-PLS probabilities of segment 
membership in respect of four classes are used to classify the individuals into seg- 
ments. Each observation is therefore assigned to one of the four groups according 
to its highest probability of segment membership. The four data sets are sepa- 
rately (group-wise) used as input matrices for the manifest variables to estimate 
the PLS path model for each group of individuals. In practical applications, the 
next step would be to identify an observable explanatory variable which best 
reflects the grouping of data, as indicated by FIMIX-PLS results (Ringle, 2006; 
Ringle, Sarstedt, et al., 2009). The FIMIX-PLS results therefore suggest an a- 
priori segmentation of data into four explainable groups of observations that permit 
multi-group analyses and group-specific interpretations of results. Subsequently, we 
apply the PLS-GAS procedure on the data. As the approach does not give any 
indication regarding the number of segments to retain from the data, we base our 
model selection decision on FIMIX-PLS results and run PLS-GAS for K = 4 
segments. 



3 Results 

The final FIMIX-PLS and PLS-GAS segmentation results for this computational 
experiment (Table 2) are comparable with the segment- specific path model esti- 
mation results of the artificially a-priori generated data sets (Table 1). Each of 
the four group-specific path model estimations fulfills the reliability and validity 
requirements for the reflective measurement of latent variables. The group-wise R^ 
outcomes increase substantially. Eor example, the R^ value of the endogenous vari- 
able r ]2 increases from 0.074 on the aggregate data level to a value between 0.664 
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Table 2 Segment-specific PLS path modeling results 







k= 1 






k = 2 




Orig. 


FIMIX-PLS 


PLS-GAS 


Orig. 


FIMIX-PLS 


PLS-GAS 


yii 


0.920** 


0.909** 


0.944** 


0.919** 


0.931** 


0.914** 


Y2\ 


0.076** 


0.067** 


0.025* 


0.080** 


0.095** 


0.1 16** 


Y22 


-0.303** 


—0.338** 


-0.320** 


0.132** 


0.183** 


0.103** 


Y32 


—0.747** 


-0.767** 


—0.865** 


0.847** 


0.889** 


0.813** 


Pn 


0.143** 


0.115** 


0.029 


0.400** 


0.438** 


0.454** 


SSE 


- 


0.003 


0.03 


- 


0.006 


0.006 


K 


0.86 


0.829 


0.897 


0.861 


0.896 


0.867 




0.627 


0.664 


0.825 


0.866 


0.917 


0.856 


Size 


0.250 


0.292 


0.188 


0.250 


0.228 


0.299 






k= 3 






k = 4 






Orig. 


FIMIX-PLS 


PLS-GAS 


Orig. 


FIMIX-PLS 


PLS-GAS 


Yn 


0.146** 


0.126** 


0.111** 


0.145** 


0.142** 


0.197** 


Y21 


0.916** 


0.949** 


0.951** 


0.916** 


0.915** 


0.889** 


Y22 


-0.797** 


-0.878** 


-0.981** 


0.565** 


0.561** 


0.601** 


Y32 


-0.301** 


-0.201** 


-0.250** 


0.092** 


0.090** 


0.073** 


/^12 


0.033 


0.066* 


0.165** 


0.422** 


0.431** 


0.376** 


SSE 


- 


0.019 


0.056 


- 


0 


0.007 


K 


0.875 


0.927 


0.923 


0.875 


0.873 


0.834 


K 


0.639 


0.707 


0.798 


0.94 


0.948 


0.887 


Size 


0.250 


0.227 


0.191 


0.250 


0.253 


0.322 



* Significant at /t < 0.10; ** significant ai p < 0.05 



and 0.948 in the final group-specific path model estimations. Instead of employing 
bootstrapping or permutation-based testing routines for PLS multi-group analy- 
sis, this paper draws on Henseler’s (2007) non-parametric procedure for testing 
the difference in group-specific PLS path model estimates. This method has been 
specifically designed for multi-group PLS analysis and, thus, has certain advantages 
compared to alternatively proposed approaches (Henseler et al., 2009). Conse- 
quently, when comparing the PLS results of each segment, the inner path coefficients 
differ significantly {p < 0.01). 

As illustrated by this numerical example with simulated data, PLS path modeling 
results on the aggregate data level may be misleading when unobserved heterogene- 
ity affects the inner path model estimates. FIMIX-PLS and PLS-GAS are capable 
of uncovering unobserved heterogeneity in the inner relationships and of dealing 
with it by means of segmentation. The group-specific outcomes change drasti- 
cally in comparison to aggregate data-level results (Fig. 1; Table 2). However, the 
group-specific inner model FIMIX-PLS and PLS-GAS path coefficients fulfill the 
a-priori expectations very well. Table 2 shows that the segment- specific inner model 
PLS results of the artificially generated data sets as well as of the FIMIX-PLS 
and PLS-GAS segmentation results are very well matched in that they arrive at 
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comparable levels. The minimum difference between these group-specific results is 
|0.001 1, while the maximum is |0.184|. 

FIMIX-PLS and PLS-GAS exhibit different performances when forming data 
groups. While FIMIX-PLS uncovers a slightly larger segment with a relatively low 
average R? values of the endogenous latent variables {k = 1), it forms another 
segment with a relatively high average outcome (k = 4). The FIMIX-PLS 
segments sizes meet the expectations from the artificial data generation procedure. 
These kinds of observation are in accordance with the findings by Esposito Vinzi 
et al. (2007). In contrast, PLS-GAS generates two larger and two smaller segments. 
While the FIMIX-PLS R^ outcomes exhibit a similar pattern and heterogeneity 
than the originally generated sets of data, PLS-GAS shows the ability to provide 
relatively high segment-specific R^ results for both endogenous latent variables in 
all four segments, which are more homogenous in comparison with FIMIX-PLS. 
Thereby, PLS-GAS forms groups of data with the highest group- specific average 
R^ results for the four segments. Finally, we calculate the sum of squared errors 
(SSE) regarding the pre-specified and estimated path coefficients, as well as the 
overall root mean squared error (RMSE). Across all segments, the SSE values of 
EIMIX-PLS range below the respective PLG-GAS results. Consequently, with a 
value of 0.083, the overall RMSE of the EIMIX-PLS results is almost half of what 
is achieved by PLS-GAS {RMSE = 0.158). 

In conclusion, EIMIX-PLS performs extremely well in respect of artificially gen- 
erated data sets with low RMSE and relatively high average R} values for the latent 
endogenous variables (e.g., k = 4). PLS-GAS, on the other hand, demonstrates the 
capability to form very homogenous groups of data with even higher inner model R^ 
values. Nevertheless, higher levels of R^ values are obtained at the cost of param- 
eter recovery accuracy. Maximizing the overall predictivity through segmentation 
fulfills the goal of PLS path modeling applications and outcomes at similar levels, 
which is advantageous for interpretation. However, exploiting the beneficial features 
of PLS-GAS in this respect requires a considerably greater computational effort. In 
this example, this effort entails a time factor of about 100 and above in comparison 
with FIMIX-PLS computations. 



4 Summary and Conclusion 

Applications of PLS path modeling usually focus on analyzing the results of 
the overall data set and do not address issues related to heterogeneity. However, 
assuming that the data have been derived from a single population is often unreal- 
istic; furthermore, analyses on an aggregate level can seriously distort calculation 
outcomes. Research interest has only recently been devoted to clustering in PLS 
and several novel approaches have been proposed to response-based segmenta- 
tion. Whereas FIMIX-PLS has been identified as the primary and most often 
used approach to date (Sarstedt, 2008), PLS-GAS has been developed to over- 
come the former procedure’s limitations. The key difference between the two 
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approaches is that PLS-GAS provides a discrete assignment of data to certain 
classes, while FIMIX-PLS allows the classification of data via probabilities of 
membership. Furthermore, PLS-GAS overcomes shortcomings of FIMIX-PLS, for 
example by accounting for heterogeneity in the outer models. The initial assess- 
ment and comparison of the two approaches reveal that both detect and deal 
with unobserved heterogeneity very well. Whereas FIMIX-PLS achieves a higher 
model fit in terms of parameter recovery, PLS-GAS achieves higher and more 
homogenous values for all endogenous latent variables across the generated 
segments. 

However, this beneficial result of PLS-GAS, which matches the prediction- 
oriented character of the PLS method, comes at the cost of a considerable com- 
putational demand. Thus, the application of PLS-GAS in complex models set-ups 
with a greater number of latent and manifest variables might prove problematic from 
an application-oriented point of view. Nevertheless, this weakness is compensated 
by the fact that PLS-GAS allows the integration of formative measurement models. 
Considering that formative measurement models have received increasing research 
interest over the last years (e.g., Gudergan, Ringle, Wende, & Will, 2008), this point 
especially is of great practical interest. However, the literature does not yet offer 
a means with which to accurately generate data with pre-specified parameters for 
such models. 

More research is necessary in this field to evaluate the performance of PLS-GAS 
in models with formative measures. In addition, researchers should analyze more 
complex models that incorporate a greater number of variables, segments and seg- 
ment sizes. This would allow for a more detailed evaluation of the approaches’ 
performance. Finally, studies on the PLS-GAS parameter settings represent critical 
avenues of future research. 

Applying FIMIX-PLS and PLS-GAS to empirical examples with typical het- 
erogeneous data is required to illustrate the applicability and problematic aspects 
of the procedures from a practical point of view. FIMIX-PLS effectively provides 
segmentation outcomes and global evaluation criteria. Comparing the FIMIX-PLS 
evaluation criteria for alternative numbers of segments allows to uncover unob- 
served heterogeneity and to determine an appropriate number of classes (Ringle, 
2006; Ringle, Sarstedt, et al., 2009). This aspect is an important benefit of FIMIX- 
PLS, which builds on probabilities of membership for each observation. In con- 
trast, the PLS-GAS algorithm is computationally demanding and does not provide 
global evaluation criteria. But PLS-GAS has the advantage of providing clear- 
cut groups of data via PLS typological segmentation. Future studies may aim at 
combining the advantageous features of both methodologies. An initial analysis 
should build on FIMIX-PLS for determining an appropriate number of classes. 
Then, PLS-GAS uses the FIMIX-PLS results as starting partition for obtaining the 
final segmentation outcome. This kind of approach offers very promising capa- 
bilities to effectively address the problem of segmentation in PLS path modeling 
applications. 
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Cluster Ensemble Based on Co-occurrence Data 



Dorota Rozmus 



Abstract Ensemble approach based on aggregated models has been successfully 
applied in the context of supervised learning in order to increase the accuracy and 
stability of classification. Recently, analogous techniques for cluster analysis have 
been suggested. Research has proved that, by combining a set of different cluster- 
ings, an improved solution can be obtained. 

In the traditional way of learning from a data set, the classifiers are built in a 
feature space. However, an alternative way can be found by constructing decision 
rules on dissimilarity representations. In such a recognition process each object is 
described by a matrix showing the similarities or distances to the rest of training 
samples. 

This research has focused on exploiting the additional information provided by a 
collection of diverse clusterings to generate a co-occurrence (co-association) matrix. 
Taking the co-occurrences of pairs of patterns in the same cluster as votes for their 
association, the data partitions are mapped into a co-association matrix. This n x n 
matrix represents a new similarity measure between patterns. The final data parti- 
tion is obtained by clustering this matrix. 

In the experiments, the behavior of partitions built on co-occurrence data is 
studied. 

Keywords Cluster analysis • Cluster ensemble • Co-association matrix • (Dis)simi- 
larity representation. 



1 Introduction 



Ensemble techniques based on aggregated models have been successfully applied 
in supervised learning in order to improve the accuracy and stability of classifi- 
cation algorithms (Breiman, 1996; Tsymbal, Pechenizkiy, & Cunningham, 2003). 
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The concept of aggregation can be described as follows: instead of using one model 
for prediction, use many different models and then combine many predicted val- 
ues with some aggregation operator. In classification the most often used operator 
is majority voting: an observation is classified to the most often chosen class. The 
presumption in this approach is that using many models instead of one will give 
better results. Among the most popular methods there are, e.g., bagging based on 
bootstrap samples (Breiman) and boosting based on giving higher weights to the 
wrongly classified examples (Freund, 1990). 

Recently, the ensemble approach for cluster analysis has been suggested in order 
to increase the classification accuracy and robustness of the clustering solutions. 
The main idea of aggregation is to combine outputs of several clusterings. The 
problem of clustering fusion can be defined generally as follows: given multiple 
partitions of the data set, find a combined clustering with better quality. Recently 
several studies on clustering combination methods have pioneered a new area in the 
conventional taxonomy (Fred, 2002; Fred & Jain, 2002; Jain, Murty, & Flynn, 1999; 
Jain & Dubes, 1988; Strehl & Ghosh, 2002). There are several possible ways to use 
the idea of ensemble approach in the context of unsupervised learning: 

• Combine results of different clustering algorithms 

• Produce different partitions by resampling the data, such as in bootstrapping 
techniques, e.g., bagging 

• Use different subsets of features (disjoint or overlapping) 

• Run a given algorithm many times with different parameters or initializations 



2 The Algorithm 

In this research the latter approach is taken to a certain extend. Generally, this 
research has three sources. The first one is proposed by Pekalska and Duin (2000) 
dissimilarity based approach. In the conventional way of learning from examples of 
observations the classifier is built in a feature space. However, an alternative way 
can be found by constructing decision rules on dissimilarity representations. In such 
a recognition process each object is described by its distances (or similarities) to 
the rest of training samples. The classifier is built on this dissimilarity representa- 
tion, i.e., on a matrix describing similarities between used examples of objects for 
training. Second source is proposed by Fred (2002) idea of combination of cluster- 
ing results performed by transforming data partitions into a co-occurrence matrix 
which shows coherent associations. This matrix is then used as a distance matrix to 
extract the final partitions. Third source is provided by Kuncheva, Hadjitodorov, and 
Todorova (2006) who got very promising results with a dissimilarity representation 
treated as data matrix. Traditionally co-occurrence matrix was treated as a distance 
matrix for hierarchical algorithms. Mentioned Authors in turn applied it as a data 
matrix for clustering methods. Here, a similar split and merge approach is used. 
Particular steps of the algorithm are as follows: 
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Fig. 1 Construction of the co-occurrence matrix and final clusters 



Split. For a fixed number of cluster ensemble members C cluster the data using, 
e.g., the Ic-means algorithm, with different clustering results obtained by random 
initializations of the algorithm. 

Combine. The underlying assumption is that patterns belonging to a “natural” 
cluster are very likely to be co-located in the same cluster among these C different 
clusterings. So taking the co-occurrences of pairs of patterns in the same cluster as 
votes for their association, the data partitions produced hy C runs of k-means are 
mapped into an x n co-association matrix: 

co-assoc(a,b) = voteSab, 

where voteSah is the number of times when the pair of patterns (a, b) is assigned to 
the same cluster among the C clusterings. 

Merge. In order to recover hnal clusters, apply any clustering algorithm over this 
co-association matrix treated as a similarity representation of the original data. 

The idea of the ensemble approach is used here in the phase of preparing the data 
that should be clustered not immediately in clustering. A special data description 
is prepared by using an aggregated approach and this matrix is then clustered by a 
single run of the clustering algorithm (Fig. 1). 



3 Benchmark Experiments 

In the mentioned research provided by Kuncheva, Hadjitodorov, and Todorova 
(2006) they got very promising results with using k-means, single linkage and mean 
linkage algorithms run on dissimilarity representation. The first method belongs to 
partitioning algorithms based on function optimization and the last two are exam- 
ples of hierarchical algorithms. The aim of this research was to examine other very 
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popular partitioning algorithms based on function optimization and compare the 
ability of recognizing the proper class structure for single application of this algo- 
rithms and proposed ensemble approach with this methods. In other words the aim 
was to answer the question: is it worth to aggregate, e.g., t-means in comparison 
with single run of this algorithm? A second aim of this research was also to com- 
pare different ensembles amongst in order to find which algorithm based on function 
optimization is the most suitable for the proposed ensemble approach. 

Among clustering methods based on function optimization there were used 
Ic-means algorithm, developed by Bezdek (1981) c-means, which is the fuzzy 
version of the k-means algorithm, partition among medoids (k-medoids), which 
is a more robust version of k-means (Kaufman & Rousseeuw, 1990) and clara. 



Table 1 Characteristics of ^gj ^ Instances # Features # Classes 



Boston 


506 


13 


4 


Ecoli 


336 


8 


8 


Glass 


214 


10 


6 


Cassini 


500 


2 


3 


Cuboids 


500 


3 


4 


Smiley 


150 


2 


4 
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which compared to other partitioning methods such as t-medoids can deal with 
much larger datasets (Kaufman & Rousseeuw). In the aggregated approach, the 
co-occurrence matrix was constructed by means of t-means algorithm, and the 
final clustering was extracted with the same algorithms as in the single appli- 
cation approach, i.e., with: Jc-means, c-means, Jc-medoids and clara. Ten single 
clusterings were generated with the number of clusters equal to the real number 
of classes. In further research it would be also worth to investigate other algorithms 
for construction of the co-occurrence matrix. 

As a measure of the correctness of the algorithm a popular Rand Index was used. 
All computations were made in R. Among used algorithms there were kmeans 
from library stats, cmeans from library el071, pam and clara algorithms 
from library cluster. 

In the research real and artificial data sets were used, their short characteris- 
tics are shown in Table 1. The first three are real data and the rest are artificial 
generated sets. Among the real data we used sets that are usually applied in classifi- 
cation for model building and its evaluation. These are data sets where the object’s 
class adherence is known. This information is treated as an a priori information 
about the number of clusters. Such an approach is often used by researches from 
the field of taxonomy. The presented real data sets are usually used in benchmark- 
ing researches in classification, and they are made available by UCI Repository 




Fig. 3 The Smiley data set 
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Fig. 4 The Cuboids data set 



(Asuncion & Newman, 2007). Among used artificial data, there were sets that are 
usually used in comparative studies in taxonomy. Their structure is presented on 
Figs. 2, 3, and 4. The Cassini and Smiley are two dimensional data sets with clearly 
separated classes, the Cuboids is a problem where inputs are uniformly distributed 
on a three-dimensional space within three cuboids and a small cube in the middle 
of them. 



4 Results 

Results reveal (Table 2) that generally in almost all cases the aggregated approach 
allowed to get better results (higher values of the Rand index) in comparison with the 
classical, single application approach. Comparing the relative improvement of the 
ensemble approach over single application, in order to detect where the difference 
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Table 2 Values of Rand Index 



Data set 




Single application 






Aggregated 




k-Means 


c-Means 


k-Medoids 


Clara 


k-Means 


c-Means 


k-Medoids 


Clara 


Boston 


0.615 


0.619 


0.615 


0.618 


0.677 


0.655 


0.655 


0.669 


Ecoli 


0.781 


0.785 


0.800 


0.772 


0.790 


0.790 


0.788 


0.809 


Glass 


0.680 


0.717 


0.705 


0.717 


0.719 


0.718 


0.718 


0.718 


Cassini 


0.801 


0.779 


0.988 


0.960 


0.911 


0.790 


0.784 


0.784 


Cuboids 


0.923 


0.879 


0.979 


0.969 


0.944 


0.938 


0.987 


0.987 


Smiley 


0.829 


0.932 


0.936 


0.946 


0.834 


0.981 


0.981 


0.981 



between aggregated and single application is the largest, it is seen that the best 
results can be observed in the case of fc-means for Boston, Glass and Cassini data 
sets, c-means for Cuboids and Smiley data sets and clara for Ecoli data set. Com- 
paring different ensembles amongst the best results can be observed again in the 
case of Jc-means for Boston, Glass and Cassini data sets, c-means for Cuboids and 
Smiley data sets and clara for Ecoli, Cuboids and Smiley data sets. In almost all 
cases the best results were given by aggregated t-means and c-means algorithms, 
whereas Jc-medoids and clara sometimes were even worse than single application of 
the algorithm. 



5 Summary 

To sum up, it is worth to notice that choosing a good taxonomic method is much 
more difficult than choosing a good classifier. It is so because in discrimination there 
is a situation where class membership for the observations is known (supervised 
learning). In the taxonomy on the other hand, the class adherence for objects isn’t 
known so the right structure that should be found by the algorithm is unknown. So, 
in order to omit the risk of choosing a wrong algorithm, the ensemble approach can 
be used in order to combine some of them. Since different clustering methods have 
different strengths and weaknesses it is expected that their joint contribution will 
have a compensatory effect. 

The next advantage of this approach is the possibility of making the results inde- 
pendent from selected methods or some parameters of these methods, e.g., initial 
values of the k parameter for k-means algorithm. This means that aggregation make 
it possible to stabilize the results of clustering solutions. 

The next advantage of the ensemble approach is robustness, which means lower 
sensitivity to noise, outliers and sampling variability. 

From the empirical results it appears that the best for the proposed ensemble 
approach are k-means and c-means algorithms whereas k-medoids and clara may 
give worse results than the single application approach. 
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Localized Logistic Regression for Categorical 
Influential Factors 



Julia Schiffner, Gero Szepannek, Thierry Monthe, and Claus Weihs 



Abstract In localized logistic regression (cp. Loader, Local regression and like- 
lihood, Springer, New York, 1999; Tutz and Binder, Statistics and Computing 
15:155-166, 2005) at each target point where a prediction is required a logistic 
regression model is fitted locally. This is achieved by weighting the training obser- 
vations in the log-likelihood based on their distances to the target observation. For 
interval-scaled influential factors these weights usually depend on Euclidean dis- 
tances. This paper aims to combine localized logistic regression with dissimilarity 
measures more suitable for categorical data. 

Categorical predictors are usually included into regression models by construct- 
ing design variables. Therefore, in principle distance measures can be defined based 
either on the original variables or on the design variables. In the first case match- 
ing coefficients, e.g., the simple or flexible matching coefficients, can be applied. In 
the second case Euclidean distances are suitable, too, since design variables can be 
considered interval-scaled. 

Localized logistic regression with the proposed dissimilarity measures is applied 
to a SNP data set from the GENICA breast cancer study (cp. Justenhoven et al.. 
Cancer Epidemiology Biomarkers and Prevention 13:2059-2064, 2004) in order to 
identify combinations of SNP variables that can be used to discriminate between 
cases and controls. By means of localized logistic regression one of the lowest 
error rates in combination with a maximal reduction of the number of predictors 
is achieved. 

Keywords Localized logistic regression ■ Matching coefficients ■ SNP data. 



1 Introduction 



The logistic regression model is one of the most popular models for describing the 
relationship between a binary response variable and a set of influential factors. It 
is used in many fields, particularly in health sciences, and can cope with predictors 
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of any scale. Categorical predictors that are nominal or ordinal-scaled can be easily 
incorporated by constructing dummy or design variables. 

Localized logistic regression (cp. Loader, 1999; Tutz & Binder, 2005) achieves 
more flexibility in estimating the regression function than classic logistic regres- 
sion by fitting a different but simple model separately at each target point. In the 
fitting process only those training observations that are close to the target observa- 
tion are used. This localization is obtained by weighting the training data in the 
log-likelihood according to their distance to the target point. For interval-scaled 
predictors Euclidean distances are usually used, which is not appropriate to cate- 
gorical predictors. This paper aims to discuss some ideas concerning the calculation 
of weights in case of categorical predictors. The application we have in mind is the 
analysis of SNP data sets that normally exclusively contain categorical variables. 

For this reason firstly in Sect. 2 SNP data and the aims of SNP data analysis are 
explained. Subsequently, logistic regression is briefly described in order to intro- 
duce the notation and to discuss two drawbacks of this method which are likely to 
cause problems in context with SNP data. Section 4 deals with localized logistic 
regression (cp. Loader, 1999; Tutz & Binder, 2005) that resolves these drawbacks. 
The calculation of localization weights in case of categorical influential factors 
and especially SNP data is presented in Sect. 5. Localized logistic regression with 
weights better suited for categorical predictors is applied to a SNP data set from 
the GENICA breast cancer study (cp. GENICA Network, n.d.). The results are 
presented in Sect. 6. Finally, a summary is given. 



2 Analysis of SNP Data 

The human genome consists of 23 pairs of chromosomes where each chromosome 
is a double-helix of two DNA strands (cp. Ickstadt, Muller, & Schwender, 2006; 
Schwender, Rabstein, & Ickstadt, 2006). These strands are sequences of nucleotides 
which contain, amongst others, one of the four nitrogen bases adenine (A), thymine 
(T), cytosine (C), and guanine (G). 

A genetic variation found in at least 1 % of the population is referred to as poly- 
morphism. The most common and best investigated type is the single nucleotide 
polymorphism (SNP). This genetic variation occurs when the reference base that 
is normally found on a specific nucleotide is replaced by another. Since the human 
genome is arranged in pairs of chromosomes three different cases can occur. These 
are illustrated in Fig. 1 . 

On the left-hand side of Fig. 1 the reference bases for three polymorphic sites 
are shown. On the right-hand side the two corresponding single-stranded DNA 
sequences from a chromosome pair (Chromosome la and lb) are given. The 
homozygous reference genotype, coded by Ref/Ref or 0, arises if both chromosomes 
la and lb show the reference base. A base variation on one chromosome, in Fig. 1 
on Chromosome lb, leads to the heterozygous reference type coded by Ref/Var or 1. 
In case of a base variation on both chromosomes the genotype is referred to as the 
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Fig. 1 Possible genotypes for three polymorphic sites (cp. Ickstadt et ah, 2006) 



heterozygous variant type coded by Var/Var or 2. According to this, SNP variables 
are categorical and usually considered nominal-scaled with three possible values 
coded by 0, 1, and 2. 

SNP data sets can contain several hundred thousand variables in case of genome- 
wide association studies. In candidate SNP studies, where based on biological 
hypotheses a set of SNPs is preselected, the number of variables ranges from several 
hundred to thousand. 

Even though most SNPs are believed to have no effect, some SNPs may con- 
tribute to the disposition of a disease, for example cancer. An important aim of SNP 
data analysis thus is to identify SNPs and SNP combinations associated with an 
altered disease risk (cp. Ickstadt et ah, 2006). From a statistician’s view the corre- 
sponding tasks are solving a classihcation problem with two classes (controls and 
cases) and categorical influential factors (SNPs) and assessing the relevance of pre- 
dictors. One possible method is logistic regression (cp. Fahrmeir & Tutz, 2001; 
Tutz & Binder, 2005). But tree-based methods such as CART, random forests, and 
logic regression (cp. Breiman, Friedman, Olshen, & Stone, 1983; Breiman, 2001; 
Ruczinski, Kooperherg, & FeBlanc, 2003) also come into consideration. 



3 Logistic Regression 

Fet X G be a measurement on V variables of any scale and y G {0, 1} indicate 
the class membership. The logistic regression model is 






( 1 ) 
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with n \= P{y = l|x) denoting the posterior probability of class 1 and = 
iPo, Pi, , Pq-iY g R® denoting the unknown parameter vector (cp. Hosmer & 
Lemeshow, 2000). The design vector z e R2 jg built from x. A common case is 
linear logistic regression where for interval-scaled influential factors z! = (l,x')- 
In order to include categorical predictors properly into the model dummy or design 
variables are constructed using for example reference cell coding, which is the most 
commonly employed method and also used throughout this text, or deviation from 
means coding (cp. Hosmer & Lemeshow, 2000). 

Logistic regression can be considered a linear classification method because it 
aims to separate the classes by a hyperplane {z \ z! P = 0} in the design space. Simi- 
lar to LDA it performs well on a wide range of classification tasks. But heterogeneity 
within classes may cause problems, particularly if classes are split in different parts 
that lie in distinct regions of the predictor space such that a linear separation by one 
hyperplane is not possible. In this situation CART and random forests may be bene- 
ficial because, in contrast to logistic regression, multiple separating hyperplanes are 
fitted. Alternatively, localized logistic regression can be used. 

Another drawback of logistic regression arises in context with model building if 
the true model is likely to contain some possibly higher interactions in addition to 
the main effects. Even if the number of predictors is not too large, the quantity of 
higher interactions that come into consideration can be huge. Since interactions have 
to be explicitly included as input variables into the logistic regression model, the rel- 
evant interactions have to be identified which is often hard and time-consuming. For 
convenience, methods are preferred that build interactions internally. This condition 
is met by the tree-based methods mentioned at the end of Sect. 2. An alternative 
way out of this problem is again provided by localized logistic regression (cp. Tutz 
& Binder, 2005). 

Both problems mentioned in this section may occur in SNP data analysis. Since 
complex diseases like cancer are likely to be caused by combinations of SNPs, often 
together with environmental factors like smoking or body mass index, the true model 
can be expected to contain higher interactions. As SNP data sets can contain an 
enormous number of variables the search for relevant interactions is hardly feasible. 
Moreover, it is likely that there exist different mechanisms for the development of 
a complex disease. That is, for distinct subjects different SNPs and SNP combina- 
tions may contribute to the disposition of the disease and thus heterogeneous classes 
may occur. 



4 Localized Logistic Regression 

In localized logistic regression at every target point x a logistic regression model 
is fitted locally. This is done using only observations close to x. Let the training 
data be {x„,y„), n = \,...,N. The localization is achieved by weighting the 
training observations x\,. . . ,Xn in the log-likelihood (cp. Loader, 1999; Tutz & 
Binder, 2005). For interval-scaled variables the weight corresponding to training 
observation is calculated as 
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(cp. Loader, 1999; Tutz & Binder, 2005) and thus depends on the Euclidean distance 
of Xn and the target point x. The bandwidth hk(x) can be chosen to be constant 
(hk{x) = k, k € R). Another possibility is to define the bandwidth as hk{x) = 
||x — X{k)\\i, k e {1, . . . , N), the distance to the kth nearest neighbor of x in the 
training data. The latter is recommended in Loader (1999) and Tutz and Binder 
(2005) because it is adaptive to the local density of data points. The function .Si is a 
kernel window. Possible choices for K are for example the Gaussian kernel 

= exp(-x^) (3) 



and the tricube kernel 



Kt{x) = I ^ (4) 

I 0 otherwise. 

Using the Gaussian kernel all training observations obtain a positive weight. In 
contrast, with the tricube kernel some weights can equal zero. 

In (2) the weights depend on x and x„, n = I, N, that is on the values of 
the original variables. Alternatively, the weights can be based on the design vectors 
z and Zn, which is for example proposed in Tutz and Binder (2005). For interval- 
scaled variables it was found in Tutz and Binder (2005) that the performance of 
localized logistic regression seems not to depend strongly on the type of weighting 
and thus the variant that is easier to handle can be used. The calculation of weights 
for categorical influential factors, and especially for SNP variables, is described in 
Sect. 5. 

Once the weights are obtained, for each target observation x an individually 
weighted log-likelihood 



N 

^x{P) = XI + (1 - T«)ln(l - 7t„))wk{x,x„) (5) 

n=l 

is calculated leading to an individual estimate Px of the parameter vector. The 
weights grow with increasing affinity of x„ and the target observation x thus per- 
mitting training observations similar to x a greater influence on the parameter 
estimation. In this way, the contribution of one predictor to the response variable 
[cp. (1)] is allowed to vary depending on the values of the remaining predictors 
and the interaction structure is thus implicitly modeled internally. Therefore, it is 
sufficient to employ models without interaction terms. Moreover, since the logis- 
tic regression model is fitted separately at every target point, localized logistic 
regression can deal with situations where a separation of the classes by one single 
hyperplane is not possible. 
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For each x the individual parameter estimate fix is computed hy iterative Fisher 
scoring. In case of numerical problems due to local collinearities of predictors it is 
proposed in Tutz and Binder (2005) to add the penalty term —XP' ip to the weighted 
log-likelihood in (5). Setting X to zero again results in the unpenalized log- 
likelihood. 

The main problem with localization is the curse of dimensionality. In high dimen- 
sions local estimates are hardly local. For this reason localization techniques work 
properly only in low dimensions and should be combined with dimensionality 
reduction (cp. Tutz & Binder, 2005). For this purpose a one-step procedure for vari- 
able selection where the relevance of predictors is assessed by a local variant of 
Wald tests is proposed. After estimating the parameter vector the test statistic 




is calculated and the predictors for which a threshold cp is exceeded are selected. 
The variance of the parameter estimates is estimated by means of the inverse Fisher 
matrix that was computed during the Fisher scoring. A simple measure of overall 
predictor importance can be obtained as relative frequency of target observations for 
which a predictor was selected. 

Altogether, the following procedure is repeated for every target observation x. 
Step 3 is optional: 

1 . Calculate weights Wk {x ,x„) (or wt (z, ) ) . 

2. Determine Px by iterative Fisher scoring (with penalty A). 

3. If the Fisher scoring converges, select predictors, recalculate the weights, and 
repeat the Fisher scoring for the selected influential factors. 

4. Use the (reduced) model to predict class for jc. 

This procedure depends on three parameters k, X, and cp that can be optimized by 
means of a grid search. 



5 Calculation of Weights for Categorical Predictors 

Since Euclidean distances are not appropriate to categorical data, alternative mea- 
sures are needed. These can be chosen to depend either on the values x: and 
X\, ... ,xx of the original variables or, as proposed in Tutz and Binder (2005), on 
the values of the design variables. 

In the hrst case, in order to obtain distance measures for categorical data it is 
common practice to apply similarity coefficients that can be easily transformed into 
dissimilarities (cp. Cox & Cox, 2001). Let S(x,Xn) denote the similarity of the 
observations x and x„. In the simplest case where S{x, x„) e [0, 1] a distance mea- 
sure can be obtained as 1 — S(x,Xn). According to [(2), p. 189], the localization 
weights then can be calculated as 
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Wk{x,Xn) = K 



/ 1 - S{x,x„) \ 

V hk(x) ) ■ 



(7) 



Similarity measures for categorical data are provided by the large group of matching 
coefficients (cp. Anderherg, 1973; Cox & Cox, 2001). These are generally used to 
measure association among variables but can be extended to assess the similarity 
of data units described by categorical variables. Matching coefficients can be well 
explained on the basis of contingency tables. The most common measure is the 
simple matching coefficient (SMC) which is given as the percentage of matches, 
that is the sum of the diagonal entries in the contingency table divided by the sum 
of all entries N . Table 1 shows a contingency table of the target observation x and a 
training observation x„ . Since our application is SNP data, both observation vectors 
are assumed to contain only the values 0, 1, and 2. For this special contingency table 
the SMC is given as 



Ssm{x , Xfi} 



a + e + i 
N 



( 8 ) 



For SNP variables the homozygous reference genotype (coded by 0) is the most 
frequent category, whereas variants, the heterozygous variant type ( 1 ) and especially 
the homozygous variant type (2), are rare. For this reason matches containing vari- 
ants may be indicative of a strong association while 0-0-matches are not. On the 
basis of these considerations in Ickstadt et al. (2006) a flexible matching coefficient 
(FMC) is proposed which is calculated as 



0.25a + 2e + Ai + 0.5(/ -b h) 

FM X, x„ Q 25a -f- 2e + 4i -1- 0.5(/ + h) + b + c + d + g' 



(9) 



This FMC down-weights the 0-0-matches and stresses matches containing variants. 
Moreover, pairs of nonmatching values in which at least one variant is present for 
both observations contribute with a weight of 0.5 to the matches. The FMC has 
turned out to be useful for the clustering of SNP variables. 

Since a large number of SNP data sets in addition to SNP variables con- 
tain epidemiological variables, the observations x and x„ can be represented as 
x' = (x^^p, x'^;) and x' = Usually the epidemiological variables 

either are categorical or are transformed into categorical variables (e.g., age groups). 
Then the similarity of x and x„ can be calculated as weighted sum 



S(X,X„) — WsnpSfm(XsNP, Xn.SNp) + WepiSsniXepi, Xn,epi) (10) 



Table 1 Contingency table 
of two SNP observations x 
and x„ . All entries {a to / ) 
sum up to N 







X„ 




0 


1 


2 


0 


a 


h 


c 


X 1 


d 


e 


f 


2 


g 


h 


i 
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of the FMC for values of SNP variables and the SMC for epidemiological variables. 
SMC and FMC can take values in [0, 1]. In order to obtain an overall similarity 
Six, Xn) G [0, 1], the weights wsnp and Wgpi should fulfill wsnp- Wepi > 0 and wsnp + 
Wepi = 1 • For convenience, we choose these weights to depend on the proportions of 
SNP and epidemiological variables. If a local variable selection is done, the weights 
are adapted to the current proportions. The localization weights Wk(x, x„) then can 
be calculated using (7). 

In the second case, we follow the proposition in Tutz and Binder (2005) and 
choose the localization weights to depend on the values of the design variables. 
The design variables can be considered interval- scaled and, therefore, the Euclidean 
distance, like in the original form of localized logistic regression, can be used. This 
is convenient, but in contrast to the dissimilarity measures already described in this 
section the Euclidean distance depends on the choice of the coding method and of 
the reference category. 



6 Application to SNP Data 

GENICA stands for the Interdisciplinary Study Group on Gene Environment Inter- 
action and Breast Cancer in Germany (cp. GENICA Network, n.d.). The GENICA 
study is an age-matched, population-based, case-control candidate SNP study that 
aims to identify genetic and gene-environment associated breast cancer risks. The 
data set we will analyze in this section contains 1,166 observations, 605 controls 
and 561 cases, of 68 SNP variables and six categorical epidemiological vari- 
ables, namely age, intake of oral contraceptives, breast cancer in family, hormone 
replacement therapy, body mass index, and smoking status. 

In order to apply localized logistic regression we construct dummy variables 
using reference cell coding and employ a simple linear model that includes an 
intercept term but no interactions. We try all combinations of kernel functions, 
the Gaussian and the tricube kernel [(3) and (4)], and dissimilarity measures, the 
SMC (8), the weighted combination of SMC and FMC (10), and the Euclidean 
distance. Eor a start we use the constant bandwidth h/c(x) = k. We calculate the 
tenfold cross-validated error rates of localized logistic regression for three cases: 
no variable selection, global, and local variable selection. Global variable selection 
is done before cross-validation by means of the Pearson chi-square test (cp. Hos- 
mer & Lemeshow, 2000). All predictors with p-value lower than 0.2 are selected. 
The local predictor selection is performed based on the local variant of the Wald 
test statistic in (6). In order to obtain optimal values for the parameters k. A, and 
Cp we employ a grid search where each combination is evaluated with tenfold 
cross-validation. 

The lowest error rates found are given in Table 2. The choice of the kernel and the 
dissimilarity measure does not affect the error rate except for local variable selection 
where the best results are obtained using the SMC. The error rates resulting from 
the weighted combination of SMC and FMC or the Euclidean distance are slightly 
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Table 2 Best tenfold cross- 
validated error rates of 
localized logistic regression 



Variable selection 


Kernel 


10 cv error (%) 


No 


Tricube 


42.5 




Gaussian 


42.4 


Global 


Tricube 


36.7 




Gaussian 


36.7 


Local 


Tricube 


36.9 




Gaussian 


37.2 



Table 3 Best tenfold cross- 
validated error rates of 
common methods 



Method 


10 cv error (%) 


CART 


37.9 


Random forests 


38.2 


Logic regression 


38.5 


Logistic regression 


36.6 



larger (38.4^0.5%). The FMC that was developed specifically for SNP data is not 
beneficial on this data set. 

The optimal bandwidths k found in case of global or no variable selection are 
rather large. For the two dissimilarity measures based on matching coefficients they 
range from 1.1 to 2.1, whereas the dissimilarity with the farthest neighbor in the 
data set can not exceed 1 . Applying the nearest-neighbor bandwidth is only sensi- 
ble in case of local variable selection, where the optimal constant bandwidths range 
from 0.5 to 0.7, but does not result in any improvements in terms of the error rate. 
The same applies for the Euclidean distance. Moreover, the large bandwidths found 
in case of no and global variable selection indicate that the local parameter esti- 
mates are hardly local, probably due to a too large number of dimensions (cp. Tutz 
& Binder, 2005). Therefore, for these cases we expect the error rates of localized 
logistic regression to be similar to the error rates of classic logistic regression. 

The performance of localized logistic regression is compared with the perfor- 
mance of CART, random forests, and logic regression which are common methods 
in SNP data analysis. Moreover, we apply classic logistic regression. For this pur- 
pose we use reference cell coding and, for convenience, employ a simple linear 
model without interaction terms. For all methods parameter tuning is performed 
and we try several variable selection methods, e.g., the Pearson chi-square test or a 
selection based on variable importance measures provided by random forests. Vari- 
able selection for all classification methods improves the error rate. The best results 
are given in Table 3. The best error rates obtained by localized logistic regression 
are slightly lower than the error rates of CART, random forests, and logic regression. 
But localized logistic regression does not improve the error rate of classic logistic 
regression. 

However, the benefit of localized logistic regression becomes apparent if vari- 
able importance is considered. The best error rate of classic logistic regression was 
achieved based on 23 variables selected by means of the Pearson chi-square test. 
In Table 4 the overall variable importance obtained from localized logistic regres- 
sion with local variable selection (corresponding to the results in Table 2) is given. 
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Table 4 Overall variable importance. ERCC2_6540, ERCC2_18880, and Gen_45_l are SNP 
variables. The epidemiological variable hrtlOyear indicates the administration of a hormone 
replacement therapy within the last 10 years 



Kernel 


ERCC2_6540 (%) 


ERCC2_18880 (%) 


Gen_45_l (%) 


hrtlOyear (%) 


Tricube 


100.00 


86.96 


0.26 


0.09 


Gaussian 


100.00 


95.11 


- 


- 



By means of localized logistic regression about the same error rates like for clas- 
sic logistic regression are achieved using mainly two instead of 23 variables. For 
comparison, we calculate the error rate of classic logistic regression on these two 
variables which is 41.1%. 

Moreover, the low number of selected variables considerably improves the inter- 
pretability of the coefficients of the locally fitted logistic regression models. A 
woman has a higher risk to develop hreast cancer if the SNP ERCC2_6540 is of the 
homozygous reference genotype and ERCC2_18880 is not of this type. The gene 
ERCC2 is related to DNA repair proficiencies (cp. Justenhoven, Hamann, Pesch, 
Harth, Rahstein, et al., 2004). 



7 Summary 

In this paper localized logistic regression was modified for discrete influential fac- 
tors and especially SNP variables. Concerning the calculation of weights three 
dissimilarity measures, namely the SMC, the convex combination of SMC and 
EMC, and the Euclidean distance in the design space, were proposed. Localized 
logistic regression with these measures was applied to a SNP data set from the 
GENICA hreast cancer study. As expected, it turned out that localized logistic 
regression only works properly in combination with variable selection. It shows 
slightly better results than CART, random forests, and logic regression. The choice 
of dissimilarity measure in most cases does not affect the error rate except for local 
variable selection where the SMC yields best results. The EMC that was originally 
developed for clustering SNP variables leads to slightly higher error rates. Unfortu- 
nately, localized logistic regression does not improve the error rate of classic logistic 
regression on this data set. But using localized logistic regression about the same 
error rates like for classic logistic regression are obtained based on mainly two 
instead of 23 variables. This considerably enhances the interpretability of the results. 
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Clustering Association Rules with Fuzzy 
Concepts 



Matthias Steinbrecher and Rudolf Kruse 



Abstract Association rules constitute a widely accepted technique to identify fre- 
quent patterns inside huge volumes of data. Practitioners prefer the straightforward 
interpretability of rules, however, depending on the nature of the underlying data 
the number of induced rules can be intractable large. Even reasonably sized result 
sets may contain a large amount of rules that are uninteresting to the user because 
they are too general, are already known or do not match other user-related intu- 
itive criteria. We allow the user to model his conception of interestingness by means 
of linguistic expressions on rule evaluation measures and compound propositions 
of higher order (i.e., temporal changes of rule properties). Multiple such linguis- 
tic concepts can be considered a set of fuzzy patterns (Fuzzy Sets and Systems 
28(3):313-331, 1988) and allow for the partition of the initial rule set into fuzzy 
fragments that contain rules of similar membership to a user’s concept (Hoppner 
et al., Fuzzy Clustering, Wiley, Chichester, 1999; Computational Statistics and Data 
Analysis 51(1):192-214, 2006; Advances in Fuzzy Clustering and Its Applications, 
chap. 1, pp. 3-30, Wiley, New York, 2007). With appropriate visualization methods 
that extent previous rule set visualizations (Foundations of Fuzzy Logic and Soft 
Computing, Lecture Notes in Computer Science, vol. 4529, pp. 295-303, Springer, 
Berlin, 2007) we allow the user to instantly assess the matching of his concepts 
against the rule set. 

Keywords Association rules ■ Exploratory data analysis ■ Fuzzy concepts ■ Tempo- 
ral changes. 



1 Introduction 



Frequent pattern mining (i.e., the identification of items whose number of co- 
occurrences exceed some predetermined threshold - the so-called minimum sup- 
port) constitutes one of the most well-known and widely-used techniques for mining 
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large volumes of data (Agrawal, Imielinski, & Swami, 1993; Agrawal, Mannila, 
Srikant, Toivonen, & Verkamo, 1996). Since it has gained large acceptance among 
business organizations it is more commonly known as market basket analysis. This 
attractivity can be attributed to several aspects. On the one hand, it is easily under- 
stood and applied by practitioners who not necessarily need thorough skills in data 
mining or statistics, i.e., the results immediately convey an insight. In addition, the 
little amount of parameterization (i.e., specifying only the minimum support) and 
its intuitive meaning add to the comfort. On the other hand, the development of effi- 
cient algorithms allows to induce frequent patterns from virtually any practical data 
source (Borgelt, 2003, 2005). In practice the set of induced frequent patterns is often 
post-processed to induce association rules that are found more intuitive to users even 
though the frequent pattern set will be smaller and contain the same information. For 
the induction of association rules, every frequent pattern is reassessed whether it can 
be turned into a rule by splitting off one item that serves as the conclusion (or conse- 
quent) whereas the remaining items act as the condition (or antecedent). To qualify 
as a rule, the conditional probability of seeing the consequent given the antecedent 
(the so-called confidence) has to exceed a given threshold, thus introducing another 
parameter. For the remainder we will focus our discussion on association rules. 

Despite the given advantages, the usefulness of the induced association rules 
relies heavily on the number of rules that are found. Since they are generated by a 
search through subsets of items the number of resulting rules may grow intractably 
large, especially when the minimum support is set to a rather low value. A loaning 
company may want to assess its current contracts with respect to patterns that may 
suggest a possible failure of repayments. Since the number of nonpayments refers 
to (hopefully) only a tiny portion of their overall contracts, the notion “frequent” 
has to be relaxed considerably to enable these subsets and rules to be found. This 
in turn will also return an excessive amount of rules that relate to correctly repaid 
loans which overgrow the really interesting rules. 

Another aspect that needs to be addressed is the temporal evolution of rules. 
When rules are interpreted as indicators of problems they are unlikely to arise out 
of a sudden but will develop and evolve slowly over time. The same argument can 
be applied to the opposite case: when a recognized problem has been fixed, the 
effectiveness of the means undertaken will not be visible instantly but, again, arise 
as time passes. 

In this paper we study an approach to postprocess a set of association rules with 
respect to temporal changes (of rule evaluation measures that are summarized in the 
next section) thus grouping these rules into clusters of similar temporal behavior. 
Since we intend to enable the user to identify interesting pattern evolutions in an 
intuitive manner, the description of an interesting pattern behavior will be allowed 
to be specified in a linguistic way. 

The remainder of the paper is organized as follows: Sect. 2 introduces the 
nomenclature for association rules and their respective evaluation measures. Sec- 
tion 3 motivates and explains the method of describing the temporal behavior of 
association rules and their subdivision into different groups or clusters of alike 
behavioral properties. Section 4 will present two examples: one proof-of-concept 
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that demonstrates the algorithm with an artificial data set whereas the real-world 
applicability is motivated with the help of empirical data. Section 5 concludes and 
points out further directions of investigation. 



2 Nomenclature 



The classic approach (Agrawal et ah, 1993) of association rule induction consists 
of first finding subsets of items (so-called item sets) that occur together in more 
than a predefined fraction (the minimum support) of transactions and then trying 
to identify a single item within each item set such that the probability of observing 
this item given the remaining items of the item set exceeds some other predefined 
threshold (the minimum confidence). By transaction, we mean a tuple (like a row of 
a data base table) with exclusively nominal attributes Ai, . . . , A„. A rule then has 
the following form (maybe not using all attributes): 

, - ^ abbr 

Ai =ai A • • • A =a„ — ^C=c = a^c. 



To quantify a rule, a multitude of rule evaluation measures has been devised (see, 
e.g., Yao & Zhong, 1999). The most well-known ones that are used here are: 



• Relative support: 

• Confidence: 

• Recall: 

• Liff: 



supp(fl — ^ c) = P{c, a) 

conf(a — ^ c) = P{c \ a) 



recall(fl — > c) 
lifl(fl — ^ c) 



= P{a 
_ P^ 



c) 

b1 



/’ll. a I 



P{c) “ Pic)P(a) 

Our method does not rely on the source of the association rules, i.e., the user is free 
to choose any algorithm that suits her needs. However, the rules must have common 
antecedent attributes in order to be displayed in one and the same chart. 



3 Linguistic Clustering 

Before we introduce our approach to cluster sets of association rules, we briefly 
review fhe visualization that is used to display the result rule sets. 



3.1 Rule Trajectory Visualization 

Each rule is represented by a circle the size of which corresponds to the absolute 
number of data base entries that are covered by the rule. 
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When given a set of rules, we have to locate each visual representation in a chart. 
We proposed to assign as coordinates the value of association rule evaluation mea- 
sures (Steinbrecher & Kruse, 2007). More specific, we use the lift value of a rule as 
its y -coordinate and the recall as its x -coordinate (the support is represented by the 
circle area). Of course, every other selection of measures is possible, however, we 
found the named measures intuitive and will use them in every figure of this work. 

To present the temporal evolution of a rule set (w.r.t. the evaluation of selected 
measures), in a real analysis environment an animation is shown that displays the 
current state of the rule set at any given time (frame). For intermediary steps for 
which no data is available, we simply use a linear interpolation. This is not a burden 
as long as the time windows are not too wide and it can be assured that there are no 
quick changes in the rule behavior within to time points. 

To acknowledge for the print medium, will use an alternative visualization of 
the temporal changes in the remainder of this paper. Consecutive temporal steps are 
connected by a straight line (indicating the movement as it would happen in the 
animation). Additionally, the last location of a rule is marked by a small dot, which 
serves as an arrow head (which would be to small to be recognized as such). An 
intended trajectory of a single rule can be seen in Fig. 1. 



3.2 Linguistic Concepts 

Filtering the rule set for predefined temporal patterns first and foremost calls for 
a language or framework in which to define the desired behaviour. We decided to 
use an approach based on fuzzy rules to describe the temporal properties of the 
evolution of rule measures. The goal is to enable the user to specify linguistically 
what kind of change of the rule evaluation measures he is interested in. For example, 
the user may be interested in rules that exhibit a fast increase of the lift as well as a 
moderate increase of confidence. When using the fuzzy approach described below, 
the user can specify individually what “moderate” or “fast” means by defining a 
fuzzy partition over the domains of the rule evaluation measures of interest. Since 
we will be able to compute for every rule of the rule set a membership degree to 
which extent the respective rule evolution belongs to the user-specified concept, we 
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can use a threshold to limit the set of resulting rules that are shown to the user or 
order all rules by descending membership degree. 

The basic idea is to allow the user to define a fuzzy rule antecedent that contains 
linguistic variable assignments over the domains of rule evaluation measures (or the 
domains of their change rate). Multiple such assignments are combined with well- 
known fuzzy connectives as t-norms or t-conorms. That is, the membership degree 
of any rule a — ^ c to the example fuzzy description stated above in the text: 



where T is a t-norm that represents a fuzzy conjunction. In this article we always 
use Tmin. Since we intend to assign a membership degree of the change of any rule 
evaluation measure (represented by the A in the linguistic variable name), we need 
to quantify this change rate from the data set. In this work, we used a straightforward 
approach of using the slope of a regression line: for any rule, the values of the 
desired rule evaluation measures are calculated for every time frame that the data 
set contains. This yields a series of points that have to be used to derive a quantitative 
value of change. A simple but quite robust way is the mentioned regression line that 
is fitted in the point set. The slope of this line serves as an indicator of the overall 
linear trend of the rule measure. 

The last prerequisite that is needed is the fuzzy partition of the domain of Aevai 
where eval represents any rule evaluation measure. In this work it is up to the user 
to specify how to partition this domain. The rationale is as follows: whenever the 
user changes a fuzzy set of the fuzzy partition, the rule set visualization that matches 
the concept being edited is updated instantly. Therefore, there is an immediate feed- 
back and the user is able to determine visually, whether the currently edited fuzzy 
set (e.g., for the linguistic value “unchanged”) really meets his intentions (e.g., by 
conceiving that the resulting rules do not change their vertical position). 

With these ingredients at hand, the rule analysis can be summarized as follows: 

1 . Specify a set of linguistic descriptions (fuzzy rule antecedents) that refer to the 
temporal changes of rule evaluation measures. 

2. Provide fuzzy partitions of the domains of the change rate of the measures from 
step 1. 

3. Evaluate for every rule the membership degrees for the linguistic concepts from 
step 1. 

4. For every linguistic concept, order the rules according to their membership 
degrees such that for a given threshold the set of rules can easily be determined 
whose membership degrees exceed this threshold. 



(Aiift is fast A Aconf is moderate) 



will be evaluated as 




(moderate) 

^conf 
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4 Experiments 

We will present twofold experimental evidence that the proposed approach can help 
to reduce the size of rule sets and guide the user by grouping the rules according 
to their temporal behavior. First, we apply the method to several artificial data sets 
where special temporal changes have been incorporated manually. After that, we 
will apply the proposed method to a real-world data set and point out an interest- 
ing discovery. Since this latter data set stems from an industrial partner, we cannot 
give details such as attribute names and values. However, the charts will reveal rules 
that have been justified meaningful to experts. This partner is an automobile man- 
ufacturer. The configuration of every car that leaves the production plant is stored 
in a data base. Whenever some problem arises and is fixed by an authorized repair 
shop the respective data base entry is updated. The goal is to track down problems 
before they affect a large number of cars and to verify whether some counteractive 
measures such as a callback have the desired effect. 



4.1 Artificial Data Set 

In this hand-crafted example, five attributes are stored for every car: Time (referring 
to the time the respective data base entry was assessed). Country (to which the car is 
sold). Engine type. Air-condition type and the Class variable. There are five different 
countries, five air-condition types, three engine types and the class variable indicat- 
ing a failure by a Boolean value. The time has three discrete values mimicking three 
months where the current state of cars was assessed. We implemented the following 
peculiarity: one special type of air-condition fails more often in two countries. The 
average failure rate in the other countries is 15%. (which was the case in a special 
real-world example we dealt with. Note, that this was a special pre-selection of cars, 
therefore the high failure rate.) The failure rates for the two designated countries 
grow to 20% and 40%, respectively. All given probabilities get added noise with a 
magnitude of ±4%, i.e., the failure rates in the “unaffected” countries may range 
from 11% to 19%. 

The left of Fig. 2 shows 25 rules that indicate a failure in its consequent. The 
antecedents are made up of all combinations of values from the two attributes Air- 
condition and Country, i.e., we did not run a rule induction algorithm here for the 
sake of brevity. In addition one could argue that this rather small rule set does not 
need any filtering or grouping. However, it should be obvious that a rule set con- 
taining several hundreds of rules as it may happen to be the case in a real example 
would be intractable to be assessed manually. 

It is obvious that one special rule is eye-catching, namely the rule that exposes a 
rather strong increase in failure rate. When applying the concept 



(Aiift is much increasing). 
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Fig. 2 Left: Rule set with 25 rules that describe fictitious failure dependencies. One peculiar 
rule behavior can be identified directly. Right: The rule with the very strong lift increase w.r.t. 
the underlying fuzzy partition of the lift change domain is returned after applying the concept 
( Aiift is much increasing) and returning only rules that have a membership degree of at least 90% 



with a threshold of 90%, the only remaining rule is exactly the rule discussed above. 
The right part of Fig. 2 shows the result. If we apply the concept 

(Aiift is increasing) 

to identify rules whose lift value grows only moderately, we arrive at a group of 
rules as shown on the left of Fig. 3. The rule designated by (1) represents the second 
rule that had been implemented into this fictitious data set. The other two rules that 
are also returned expose an alike lift behavior that can be addressed to noise. 



4.2 Real-world Data Set 

After demonstrating the applicability of our proposed approach, we present further 
evidence in form of a real-world data set from a cooperation with a real automobile 
manufacturer. The data set under analysis contains approximately 300,000 tuples 
with 180 attributes. Since this data set was issued by an industrial partner, we are 
not allowed to provide confidential information such as the meaning of the attribute 
values or the specific interpretation of the class variable. All we can tell is, that 
every tuple in the data set represents a unique car that left the production plant of 
the vehicle manufacturer. Since for every car the time of a failure was logged as well, 
we were able to partition the full set of tuples into six data sets of (in this case) equal 
width. The rule set (that was induced via a hybrid rule induction method discussed 
e.g., in Steinbrecher, Riigheimer, & Kruse, 2008) under consideration contained 
95 rules. The right chart of Fig. 3 depicts the result after matching the rule set against 
the concept 



204 



M. Steinbrecher and R. Kruse 





Fig. 3 Left: The artificial data set was filtered against the concept (Auf, is increasing) with a 
minimum membership degree of at least 20%. Among the returned rules, the one with the strongest 
lift increase represents the second intentionally included rule. Right: Filtering a rule set of 95 rules. 
The chart shows rules that match the concept ( Asupp is increasing) at least to a degree of 65%. The 
trajectory marked with (2) shows a good example of the obvious support increase since the circle 
area represents the support 



(Asupp is increasing) . 

The threshold was set to 65% to shrink the number of results. Rule (2) intuitively 
shows the support increase hy the growing size of the circle. The trajectories of the 
other depicted rules are somewhat indistinguishable but expose the same support- 
increasing behavior when using the animation visualization instead of the depicted 
global trajectories. 



5 Conclusion and Future Work 

This work presented an intuitive approach to the topic of association rule post- 
processing. We focused on data sets with a temporal variable because experience 
teaches that association rules as descriptions of suspicious patterns never arise 
immediately or vanish abruptly. 

Fuzzy concepts were chosen to be the framework for the linguistic specification 
of the rule evolution behavior of interest. The user specifies fuzzy descriptions fhaf 
are defined over the domains of interest, here; the domain of change rates of selected 
rule evaluation measures. By changing the fuzzy partitions over these domains the 
user is able to relax or tighten the strictness of his concepts. Since rules can have 
a membership degree between zero and one to every concept, the resulting rule set 
(rules that have a positive membership degree) can be ordered to focus on rules that 
match best the user concept. 
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The presented examples - fictitious and real - in Sect. 4 gave rise to a positive 
view: the presented framework was able to filter out rules that could be intuitively 
considered to match the linguistic concepts. 

However, due to the way of assessing the change rate of the rule measures, it can 
be easily seen, that some temporal patterns cannot be detected. Namely all those that 
exhibit a cyclic or alternating evolution since the slope of the regression line will be 
close to zero. Therefore, the set of assessment measures for rule behavior has to be 
extended. 

Another interesting aspect that is currently investigated is the generation of the 
fuzzy partition that underlies all linguistic variables. Up to now, they were chosen 
beforehand to yield an intuitive result of what is considered “unchanged” or “much 
increasing”. One way of involving the user is to allow him to modify the fuzzy 
sets via a visual tool and change the matching rules instantly. This would result 
in a parameter-free process where the user can immediately see whether the current 
fuzzy partition with the respective fuzzy sets meets his intentions about the resulting 
rule set. 
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Clustering with Repulsive Prototypes 



Roland Winkler, Frank Rehm, and Rudolf Kruse 



Abstract Although there is no exact definition for the term cluster, in the 2D case, 
it is fairly easy for human beings to decide which objects belong together. For 
machines on the other hand, it is hard to determine which objects form a cluster. 
Depending on the problem, the success of a clustering algorithm depends on the idea 
of their creators about what a cluster should be. Likewise, each clustering algorithm 
comprises a characteristic idea of the term cluster. For example the fuzzy c-means 
algorithm (Kruse et al.. Advances in Fuzzy Clustering and Its Applications, Wiley, 
New York, 2007, pp. 3-30; Hoppner et al.. Fuzzy Clustering, Wiley, Chichester, 
1999) tends to find spherical clusters with equal numbers of objects. Noise cluster- 
ing (Rehm et al.. Soft Computing - A Fusion of Foundations, Methodologies and 
Applications 1 1(5):489-494) focuses on finding spherical clusters of user-defined 
diameter. 

In this paper, we present an extension to noise clustering that tries to maximize 
the distances between prototypes. For that purpose, the prototypes behave like repul- 
sive magnets that have an inertia depending on their sum of membership values. 
Using this repulsive extension, it is possible to prevent that groups of objects are 
divided into more than one cluster. Due to the repulsion and inertia, we show that 
it is possible to determine the number and approximate position of clusters in a 
data set. 

Keywords Air traffic management • Fuzzy c-Means ■ Noise clustering ■ Repulsive 
prototypes. 



1 Introduction 

Prototype-based clustering algorithms have one thing in common: they require 
knowledge about the expected number of data clusters in advance. Even if this infor- 
mation is at hand, initialization of the prototypes has still a strong influence on the 
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quality of the clustering result. So far, the problem of finding the correct number of 
clusters and a good initialization for the prototypes cannot be solved analytically, 
hence, experts or heuristics are needed. 

In this paper, we present a method that makes use of available information, such 
as the expected size and separation of clusters in a data set, to gain knowledge about 
the number of clusters and their approximate position. This is done by using repul- 
sive prototypes. The repulsive force prevents that prototypes come close to each 
other which leads to well separated prototypes. The result can be used to initialize 
(non-repulsive) clustering algorithms. 

This paper is structured in four parts. The next part contains a brief description 
of fuzzy c-means and noise clustering, which will be used to introduce repulsive 
prototypes later (Bezdek, 1981; Dave & Krishnapuram, 1997). In Sect. 3, we will 
introduce the mathematical background and the usage of repulsive prototypes. In 
Sect. 4 we will present results on a practical application. Finally, we conclude with 
Sect. 5. 



2 Fuzzy c-Means and Noise Clustering 

Repulsive prototypes extend the concept of fuzzy c-means (FCM) and derivatives, 
e.g., noise clustering (NC). Although both algorithms are very well known, some 
mathematical details are needed in the next section, which makes it necessary to 
repeat them at this point. Let X C F be a finite set of data objects of a vec- 
tor space V with |A| = n. The clusters are represented by a set of prototypes 
B = {Pi, . . . , Pm} C V which can be initialized randomly. Only the number of 
prototypes m must be known in advance. Let 1 < e R be the fuzzifier and 
U e be the partition matrix with g [0, 1] and V j : J2T=i “9 = 

And finally, let J : V x V — ^ K be a distance function with its abbreviation 

dij = d{Pi , Xj). 

Fuzzy c-means is defined as an objective function J that is to be minimized 



The minimization of J is done by iteratively updating the members of U and 
B and is computed using a Lagrange extension to hold the side constraint of 
[ Uij = 1 . The iteration steps are denoted by a time variable f G N denoting 
/ = 0 as the initialization step: 



m n 




i=i j=i 




( 1 ) 
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For noise clustering, an additional cluster is specified which is represented by a 
virtual prototype fio which has no location in V . Instead, it has a constant 0 < v e M 
distance to all data objects: V j : dio = d{fio,Xj) = v which is called noise dis- 
tance. ySo is not represented as a member of V , and so, it is not updated during the 
iteration process. The noise prototype is introduced to assign higher membership 
degrees to the noise cluster for all data objects whose distance to regular prototypes 
exceeds the noise distance. This favors regular prototypes to be better placed in the 
center of data clusters without being heavily attracted by noise data. 



3 Repulsive Prototypes 



Unfortunately, noise clustering as described above, has certain disadvantages when 
it comes to separation of clusters. It is quite likely that two prototypes end up in the 
same data cluster, leaving one or more clusters without any prototype. To prevent 
this, a penalty term can be added to the objective function to push prototypes further 
away from each other. This works fine under cerfain circumstances, but it offers 
only an indirect influence on the repulsion behavior of the prototypes. An alternative 
procedure is to change the update function of the prototypes directly. Thereby, the 
prototypes’ behavior can be easily controlled, however, at the expense of the fact, 
that the algorithm cannot be based on an objective function anymore. 

The repulsion among the prototypes is calculated pairwise for each pair of proto- 
types. The strength of the repulsion depends on the distance between two prototypes 
and on the sum of membership values to the respective prototypes. The modihed 
update function for repulsive prototypes is defined as 




/ 

\ B C 



\ 



vm-pu\) 






A 



( 3 ) 



The first term {A) is identical to (2) and describes the impact of the data objects to 
a prototype while the rest of the formula describes the repulsion. Because the repul- 
sion is computed for each pair of prototypes, the repulsion is calculated between 
prototype i and every other prototype. The term (B) is a unified vector for the 
direction of the repulsion. The term (C) takes the influence of the data objects to 
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the respective prototype into account with u\ = = i Imagine there are two 

prototypes in one data cluster. If they would push each other away with equal force, 
they would both be pushed out of the cluster. This is not the desired result, since one 
of them should remain inside. In crisp words: the term (C) gives that prototype more 
force, to which more data objects are assigned. In fuzzy terms, the prototype with 
the larger sum of membership values is preferred in the pairwise repulsion process. 
The last term {D) takes the distance between prototypes into account. The function 
(p : K — ^ [0, 1] should be monotonically decreasing and continuous. In principle, 
every function that holds these constraints is valid, but here, we will consider one 
family of functions in particular, which is described in the next paragraph. Consider, 
that the value inside the sum is between 0 and 1. If the data objects are not scaled 
to a unified space, the influence of the repulsion might be too weak to counteract 
the attraction of the data objects. The parameter c handles the balance between the 
attraction of the data objects and the repulsion among prototypes. If the data set is 
standardized, c can be set to I . 

Practical tests have shown that a function of the family (p{x) = is not 

suitable for the repulsion process. Instead, the logistic function turned out to be 
very feasible. This is why we decided to use the following variation of the logistic 
function: 



(fix) = 



1 

1 + ■ 



The value a is the distance at which the function <p has the value 0.5. The parame- 
ter a describes the gradient of <p at the point ct. The problem is, that such a parameter 
a is not very intuitive. Therefore, the definition of <p is changed to a formula that 
holds the two constraints: (p{a) = 0.5 and (p(y) = a with y > a and a e (0, 0.5). In 
words: the repulsion should have half its strength at a distance of a and should have 
almost no effect at a distance of y. Mathematically, “almost no effect” is described 
by O', so that a fixed value of a = 0.05 might be useful. With these constraints, the 
parameter a can be computed by 



- 1 ) 

a = — . 

y — o 

Using the parameter a and y (leaving a at a fix value), it is easy and intuitive to 
control the repulsion. In Fig. I, is plotted for several sets of parameters. Applying 
noise clustering with repulsive prototypes, it has proven useful to set a = 0.9v, 
a = 1.8v and a = 0.05 to gain well- separated clusters. Depending on the inherent 
data structure, appropriate values should be assigned to these parameters, v can be 
interpreted as the maximal spacial extension of a cluster while a and y influence the 
minimal distance between the centers of clusters. 

Tests have shown that the impact of the attracting conventional FCM-component 
and the repulsive component in the update equation need to be balanced appropri- 
ately. Otherwise, due to the FCM-component, prototypes will be attracted to the data 
clusters in one iteration followed be a strong repulsion in the next. The algorithms 
behavior can be described best as “nervous”. To prevent this, it is necessary to define 
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Fig. 1 Repulsion function with several parameters for y 



a relatively small learning rate S e (0, 1], such as 3 = 0.1. Then, the update formula 
of the prototypes is expanded to 

Sometimes, the hnal setup of the prototypes is not useful to generate a compre- 
hensible partition of the data objects, because the repulsion influences the position 
of the prototypes so that they cannot be seen as representatives of the clusters 
any more. For this reason, we do not consider clustering with repulsive pro- 
totypes as an alternative to fuzzy c-means or other prototype-based clustering 
algorithms (Gath & Geva, 1989; Gustafson & Kessel, 1979). But the repulsion has 
proven very useful to solve the initially mentioned problem of finding the number 
and approximate position of the clusters. 

Noise clustering, extended with repulsive prototypes is still a prototype-based 
algorithm that needs the number of prototypes to be known in advance. Since the 
algorithm should only be used for initialization purposes, it is not necessary to have 
exactly the same number of prototypes as there are clusters in a data set. Since 
this is the case, it is possible to overestimate the number of clusters. In fact, it is 
useful to overestimate the number of clusters with two or three times the prototypes 
as there are expected clusters in the data set. This way, it is almost guaranteed, 
that each cluster is found by at least one prototype. Due to the repulsive behavior 
of the prototypes and assuming the parameters are chosen correctly, each cluster 
will hold only one prototype. Accordingly, many prototypes are floating outside of 
clusters and might stabilize on some noise data objects. Due to the term (C) in (3), a 
prototype floating outside of a cluster is not able to chase away a prototype already 
inside of this cluster. Even if there are two prototypes initialized inside of a cluster, 
a small unbalance is sufficient that either of the prototypes will push out the other. 

After running noise clustering with repulsive prototypes and an overestimated 
number of clusters, a simple test T : B — ^ {1.0} can be used to determine if a 
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prototype is considered to be inside a cluster or not. This test can depend on the 
specific clustering task. A very simple test would he to consider a minimal sum of 
membership values of all data objects towards one prototype: 

0 , otherwise. 

Finally, the position of all positively tested prototypes can he used to initialize 
another prototype-hased clustering algorithm such as fuzzy c-means. 

Experiments have shown, that if there is little or even no noise, prototypes that 
are outside of data clusters do not stop moving. The reason is, that they are strongly 
influenced even by small changes of other prototypes. Therefore, the usual approach 
to terminate the algorithm, i.e., when the difference in the membership matrix from 
one iteration step to the next one is small (||Z7'~* ~ U‘\\ < s), is not applicable. 
A simple solution for this problem is to terminate the algorithm after a previously 
defined number of iterations. An alternative could be, to measure the difference of 
prototype positions for prototypes that are detected to be inside a cluster. 



T(Pi) = 



4 Experimental Results 

Repulsive prototypes become of great value if there are many data sets to analyze 
having similar properties regarding the expected cluster size, but different number 
of clusters. This is the case for an actual problem in the domain of air traffic manage- 
ment. In this application, the airspace around airports needs to be analyzed. Groups 
of aircraft need to be found that approach the airport from similar directions. For 
this purpose, the first radar point of the aircraft inside the specified airspace is con- 
sidered. When applying fuzzy c-means or noise clustering with randomly initialized 
prototypes, it is very unlikely that each cluster is found by exactly one prototype. 
The data set presented in Fig. 2 (left) is similar to one of our examples, but due to 
educational purposes artificially generated. An approach using noise clustering with 
randomly initialized prototypes often ends in a result like in Fig. 2 (right). One clus- 
ter is associated to two prototypes which results in at least one data cluster wrongly 
associated to the noise cluster (illustrated by the circle on the left side). 

As shown in Fig. 3 (left), repulsive prototypes can be used to find the number and 
position of the clusters. In a second step, noise clustering can be used to partition 
this data set which produces the result shown in Fig. 3 (right). For this example, the 
following parameters were used: co = 2, v = 0.2, ;n = 20, a = 0.9v, y = 1.8v, 
a = 0.05 and = 50. 

When it comes to the number of prototypes, the question may arise by what 
extent the number of prototypes may be overestimated. A test with artificial data 
has shown that there is almost no restriction to the number. Because of the quadratic 
nature of calculation complexity, however, it might not be wise to overestimate the 
number of clusters unreasonably. 
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Fig. 2 Example for clustering the entrance positions of the airspace surrounding an airport. The 
airport is located at the gray area in the middle. The big circle has a diameter of 200 NM (370 km). 
The data recordings in the middle are considered to be noise because they are too far away from 
the border. The “tails” of the Prototypes represent their path from their initialization point to their 
convergence point 




As for all clustering algorithms, there are examples where repulsive prototypes 
do not work. An example is shown in Fig. 4. If at least one large and vast cluster 
is in the data set and several small in close proximity, so that the distance between 
the small clusters is less than the diameter of the large one, than repulsive cluster- 
ing does not find a useful result. In our problem of clustering flight data, repulsive 
clustering worked for all examples very well. The correct number of clusters were 
found in all cases, only the noise distance had to be manually adjusted in some data 
sets due to unusual sized clusters. 
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Fig. 4 Example where repulsive clustering does not produce a satisfying result. If the repulsion 
is strong enough so that the large clusters are represented by just one prototype, then the small 
clusters can not be represented by one prototype each (left). If the repulsion is adjusted in a way, 
that the small clusters are coirectly approximated, the large clouds are divided into several clusters 
(right) 



5 Conclusions and Future Work 

We have shown that noise clustering with repulsive prototypes can he used to hnd 
the number and approximate position of clusters in a data set. This can be useful if 
the exact number of clusters is not known in advance or if the clusters are located in 
a way, that a straightforward approach with fuzzy c-means does not produce good 
results due to bad initialization of the prototypes. We have also shown that repulsive 
prototypes can be parametrized intuitively, allowing their application without expert 
knowledge. 

The principle of repulsive prototypes allows to use them with every prototype- 
based clustering algorithm. Therefore, we will test the behavior of repulsive pro- 
totypes with other prototype-based clustering algorithms. We have not tested the 
behavior in high dimensional spaces, which will be done in near future. The prob- 
lem of applying repulsive prototypes on problematic data sets like shown in the last 
section, might be solved by localized distance measures. 



References 



Bezdek, J. C. (1981). Pattern recognition with fuzzy objective function algorithms. New York: 
Plenum. 

Dave, R. N., & Krishnapuram, R. (1997). Robust clustering methods: a unified view. IEEE 
Transactions on Fuzzy Systems, 5, 270-293. 



Clustering with Repulsive Prototypes 



215 



Gath I., & Geva, A. B. (1989). Unsupervised optimal fuzzy clustering. IEEE Transactions on 
Pattern Analysis and Machine Intelligence, II, 773-781. 

Gustafson, D. E., & Kessel, W. C. (1979). Fuzzy clustering with a fuzzy covariance matrix. In 
Proceedings of the IEEE Conference on Decision and Control, San Diego, 761-766. 



Part III 
Mixture Analysis 



Weakly Homoscedastic Constraints 
for Mixtures of t -Distributions 



Francesca Greselin and Salvatore Ingrassia 



Abstract In this paper we introduce the concept of weak homoscedasticity for 
covariance matrices of the component densities, in the framework of constrained 
formulations of the maximum likelihood estimation for mixture models. Further, we 
give a test for assessing weak homoscedasticity in two sample data. Based on such 
approach, we present how to implement a constrained EM algorithm for mixtures of 
f-distributions. The proposal is illustrated on the ground of numerical experiments 
which show its usefulness in data modeling and classification. 

Keywords EM algorithm • Mixture models ■ ? -Distributions ■ Weak homoscedas- 
ticity. 



1 Introduction 

Although most classical multivariate analysis has been concerned with the multi- 
variate normal distribution, an increasing amount of attention has been given to 
alternative distributional models. One area of applicability of such models is in the 
study of robustness of multivariate techniques when departing from normality in the 
underlying distributions. The difficulties associated with many alternatives are both 
theoretical and practical. There is, however, a simple class of distributions having 
similar features to the multivariate normal but which exhibit either longer or shorter 
tails than the normal: the family of ? -distributions, see, e.g., Kotz and Nadarajah 
(2004). Like the normal, ? -distributions belong to the general family of ellipti- 
cally symmetric distributions, but with an additional parameter (called the degree 
of freedom) which acts as an adaptive robustness factor, tuning the heaviness of the 
tails. 

In the framework of mixture modeling based on normal and t multivariate dis- 
tributions, parameter estimation is often performed according to the likelihood 



Francesca Greselin (Kl) 

Dipartimento di Metodi Quantitativi per le Scienze Economiche e Aziendali, Universitii di Milano 
Bicocca, Milan, Italy, e-mail: francesca.greselin@unimib.it 



A. Fink et al., (eds.). Advances in Data Analysis, Data Handling and Business 
Intelligence, Studies in Classification, Data Analysis, and Knowledge Organization, 
DOI 10.1007/978-3-642-01044-6-20, © Springer- Verlag Berlin Heidelberg 2010 



219 



220 



F. Greselin and S. Ingrassia 



approach. However, it is well-known that the likelihood function may be affected 
by both singularities and local maxima, causing the failure of optimization proce- 
dures like the EM algorithm. In order to address these drawbacks, in Hathaway 
(1986) a constrained formulation of the maximum likelihood estimation for mixture 
models has been proposed. Furthermore, based on such results, constrained EM 
algorithms have been proposed for both multivariate normal and ? -distribution (see 
Ingrassia, 2004; Ingrassia & Rocci, 2007; and Greselin & Ingrassia, 2009). In prac- 
tice, such approaches amount to imposing suitable constraints on the eigenvalues of 
the covariance matrix of each component density. 

In this paper we propose a stronger constraint than those considered before. 
Based on a statistical analysis developed on a real dataset, we introduce here the 
definition of weak homoscedasticity, an intermediate case lying between the two 
extreme cases of homoscedasticity and heteroscedasticity. Weak homoscedasticity 
corresponds to covariance matrices having the same ordered set of eigenvalues. Fur- 
ther, a test for detecting weak homoscedasticity in two sample data is provided, 
under the multivariate normal assumption; finally we apply such ideas to a problem 
of data modeling using mixtures of t-distributions. 

The rest of the paper is organized as follows. In Sect. 2 basic concepts about mix- 
tures of multivariate ? -distributions and parameter estimation are recalled, according 
to the likelihood approach. In Sect. 3 the definition of weak homoscedasticity is 
introduced and a statistical test for proving weak homoscedasticity in two sample 
data is provided. In Sect. 4 the results of some numerical studies are reported. Finally 
some concluding remarks are given in Sect. 5. 



2 Preliminaries and Notation 

A q dimensional random vector X is said to have a multivariate t distribution with v 
degrees of freedom, location parameter ft and positive definite inner product matrix 
X , if its joint probability is given by 



r(i^)|xr'/2 

p(x;^.-E,v) - ^ 5(x,/r; E)/v}(-+?)/2 ’ 

where 

S(x, fi; T) = (x — /iyT~\x — fi) (2) 

denotes the Mahalanobis distance between x and /i , with respect to the matrix E . 
In this case we write X ~ f„(/t,, E ). 

Suppose that X|{7 = u ^ / u) for scalar U I where v is positive 

and may be a non integer. We then have the following properties: 

1. X ~ tq(fl, E, V). 

2. E(X) = 11 (y > 1) and Cov(X) = uE/(u - 2) (u > 2). 

3 . U\x ^ xl+q/iv + S(i^, 
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For a complete review of mathematical properties and statistical methods based on f- 
distributions see, e.g., Kotz and Nadarajah (2004). In the following we shall assume 

V > 2. 

Let us consider maximum likelihood estimation for the vector y of the parameters 
of a k component mixture of multivariate t -distributions, given by 

k 

/(x;K) = ^Q!7.P(x;/t,^.,5:y,V^), (3) 

7=1 

where y = (ai, . . . , /ti, . . . , /a^, Z i, . . . , Ejt, ui, . . . , Vk), and T is the parame- 
ter space 

p _ g ^k{2+q+(q^+q)IT\ . 

a\ + ■ ■ ■ at = 1 , cry > 0, | X y | > 0, Uy > 2, fory = I,... ,k}. (4) 

Let C{y) be the log-likelihood function of y, given a sample X = {xi, . . . ,x„} 
of n i.i.d. observations coming from a population with law (3). The estimation of 
the parameter y is usually performed using the EM algorithm. This generates a 
sequence of estimates where e T denotes the initial guess and e 

r for m e N, so that the sequence {E(y*'"^)}m6N is not decreasing. 

When a fitted component has a very small value of the determinant of the covari- 
ance matrix, relatively large local maxima of the likelihood can occur. Such a 
component corresponds to a cluster containing a few data points, either relatively 
close together, or almost lying in a lower dimensional subspace. Thus the EM algo- 
rithm may converge to snch a spurious maximizer, or even to a singularity of £(y) 
whenever the determinant of a covariance matrix is nearly null. In order to avoid 
singularities and to reduce the number of spurious maximizers, in Hathaway (1986) 
a constrained re-formulation of the problem has been proposed. 

In the normal univariate setting, such constraint requires that 
min, yiy CT, /oj > c > 0 while in the multivariate normal case the constraint concerns 
the smallest eigenvalue of the product matrix E /, E , that is 

min A (E/,E7*) > c > 0, (5) 

where A ^E/,E is the generic eigenvalue of the product matrix E/,EJ* (see 
also Hennig (2004)). In particular, under homoscedasticity, i.e., Ei = = Ej;, 

condition (5) yields mini</,yiy<^ A ^E/,E~*^ = mini</,^y<^ A(I) = 1 where I is 
the ^ X ^ identity matrix. However, a closer analysis of Hathaway’s results, shows 
that condition (5) can be extended in the more general framework of mixtures of 
elliptical distributions (and in particular, for mixtures of f-distributions). 

In this paper a new kind of constraint is imposed on the covariance matrices 
of the component densities in (3). In particular, in Sect. 3 we shall introduce the 
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definition of weakly homoscedastic covariance matrices when we can assume that 
the covariance matrices E i , . . . , share the same ordered set of eigenvalues. This 
proposal is motivated by a statistical analysis of the Crab data set, to which the 
following subsection is devoted. 



2.1 The Crab Data Set 

The Crab data set consists of measures over a sample of 100 rock crabs of the genus 
Leptograpsus (available at http://www.stats.ox.ac.uk/pub/PRNN/). Each specimen 
has q = 5 measurements: the width of the frontal lip (FL), the rear width {RW), the 
length along the midline (CL) and the maximum width (CW) of the carapace, and the 
body depth (BD) (in mm); the data are grouped into two classes by sex, see Fig. 1. 
In the framework of t mixtures, this dataset has been used in Peel and McLachlan 
(2000) and Lin, Lee, and Hsieh (2007), where a sample of 100 units (with = 50 
males and «2 = 50 females, ignoring the right classification) has been clustered. 

Usually, based on the results of Hawkins’ simultaneous test of multivariate 
normality and equal covariance matrices (see Hawkins, 1981), the two group con- 
ditional distributions are assumed to be normal with common covariance matrix. 



15 25 35 45 



12 16 20 




Fig. 1 Scatterplot matrix of the Crab data set 
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However, in Peel and McLachlan (2000) it is pointed out that this assumption has a 
marked impact on the implied clustering of the data: indeed forcing a homoscedastic 
model produces a larger misallocation rate than in the case in which no constraint 
is imposed. In this cited paper, Peel and McLachlan htted the Crab data set using 
both normal and t mixtures with equal covariance matrices, concluding that the two 
models lead almost to the same error rate (19% and 18% respectively). On the con- 
trary, if no constraint is imposed, the observed misallocation rate decreased to 11%. 
Thus the assumption of homoscedasticity led to a much inferior clustering of the 
data: this apparent contradiction raised our curiosity and suggested undertaking a 
deeper evaluation of the adequate constraints. 

We point out that homoscedasticity means that the ellipsoids of equal concen- 
tration have both the same shape and the same principal axes. However, a graphical 
inspection of the data, see Fig. 1, suggests that here the latter assumption may appear 
too strong. Our belief is that, in these cases, the constraints can be usefully lightened 
requiring only covariance matrices with the same shape. 



3 Weakly Homoscedastic Covariance Matrices 



Definition 1 Two covariance matrices Y, \ and E2 with eigenvalues < A2** < 
in ( 2 ) ( 2 ) ( 2 ) 

■■■ < Xq and Aj < A 2 < ■■■ < Xq respectively, are said to be weakly 
homoscedastic if they have the same ordered set of eigenvalues, that is AJ^*^ = A^^* 
(h = 1,...,^). 

Hence, weak homoscedasticity can be thought as an intermediate case, laying 
between heteroscedasticity and homoscedasticity. It corresponds to impose equal 
volume and shape to correlation matrices, along the lines of Murtagh and Raftery 
(1984). In the rest of this section, we consider the problem of testing if two 
covariance matrices li 1 and Y 2 are weakly homoscedastic. 

Let us consider the spectral decomposition of (j = L 2), i.e., = T j 

Aj T'j where Aj is the diagonal matrix of the eigenvalues of (in non-decreasing 
order), and is an orthogonal matrix whose columns are the standardized eigen- 
vectors; the symbol ' denotes matrix transpose. Then the usual test for homoscedas- 
ticity 

Ho\Yi = Y 2 vs. /fi: El 7^X2 

can be weakened, testing only the equality between the ordered sets of the eigenval- 
ues of E 1 and E 2 respectively 

Hy.A,=A 2 vs. H^:A,^A 2 . (6) 

Let Xj^*, . . . , x,V,^ be a sample of size n\ drawn from a multivariate normal distri- 
bution with covariance matrix Ei, let Xi denote the ni x q data matrix with rows 
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Xj*\ . . . , x,Vi' and let Si be the sample covariance matrix of Xi. According to the 
principal component transformation 



.( 1 ) 






i = ,n\ 



(7) 



the data , yi** are uncorrelated, their covariance matrix Li is the diagonal 

matrix of the eigenvalues of Si, Gi is an orthogonal matrix whose columns are the 
corresponding standardized eigenvectors, and xi is the mean vector of Xi. Analo- 

(2) (2) 

gous notation is adopted for a second sample Xj , . . . , x„/ of size drawn from a 
multivariate normal distribution with covariance matrix Y, 2 . 

Denoting by Y i and Ya the data matrices row- wise composed by y , . . . , y,V,^ 

and y®, . . . , y,® respectively, we get 



Yi = (Xi - lxi)Gi and Yj = (Xj - lx 2 )G 2 , 



where 1 = (1, 1, . . . , 1) is a vector of «i (respectively 722 ) ones. 

The test for weak homoscedasticity (6) can now be restated as follows 



(ii" = Af')n(i'" = Af)n... n(A;" = Af). 

: there exists h e {1, . . . , such that ^ A® . 



Recalling that Gaussian uncorrelated random variables are independent, thus the 
test (8), under the assumption of multinormality, can be performed through q 
simpler tests 

if^A'/' = Af vs. h = \,...,q. (9) 

Since the eigenvalues of the covariance matrix Xi (X 2 ) coincide with the vari- 
ances along the principal axes, the h-th hypothesis in (9) can be tested by means 
of a well known f-test on equality of variances based on the samples y|j*' and y® 
obtained from x[*^ and x[^^ {h = 1, . . . , ^) by means of the principal component 
transformation (7). 

Under the weak homoscedasticity assumption, the covariance matrices in (4) 
have the same ordered set of eigenvalues, i.e., 0 < Amin = Ai < A 2 < ■•■ < 
Xq = Amax < -foo. Heuce, based on results given in Ingrassia (2004), we have 



Amin(^/!^ ;■ *) — 



Amin 1^/7) An 

^max (X./) An 



hj = 1,2, h + j 



and thus the constraint (5) is satisfied once it results 



An 



> c > 0. 



( 10 ) 
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For the sake of clarity, in the rest of the paper the usual homoscedasticity 
assumption will be referred to as the strong homoscedasticity condition. 



4 Numerical Studies 

In this section we present some numerical studies concerning modeling of the Crab 
data set through a mixture of t -distributions. Preliminarily, the weak homoscedastic- 
ity of the two classes has been tested: performing q = 5 tests on variance equality, 
the following p values 

0.1912 0.2647 0.2862 0.4505 0.2586 

have been obtained, assessing statistical evidence for the weak homoscedasticity. 
We shall consider two groups of simulations: first we compare the unconstrained 
EM algorithm with the constrained versions implementing both weak and strong 
homoscedasticity. Afterwards, we consider a problem of robust classification along 
the design proposed in Peel and McLachlan (2000) (see also Sect. 7.8 in McLachlan 
& Peel, 2000). 

The purpose of this first analysis is to cluster the data, ignoring the right clas- 
sification, and to compare the obtained performances by fitting a mixture of two t- 
distributions, in the three cases of no constraints, weak and strong homoscedasticity. 

To begin with, a batch of simulations concerned a set of 100 runs of the EM algo- 
rithm based on a mixture of f -distributions without any constraint for 100 randomly 
chosen starting points. The initial mixing weight has been drawn from the uni- 
form distribution on (0, 1) (and then = 1 — aj^^), the initial vector means and 
covariance matrices have been evaluated on two random subsamples of the given 
groups. Eor each run the misclassification error has been evaluated and the results 
are shown in the first row of Table 1 . 

The EM algorithm failed in 6% of cases, due to singularities of the likelihood 
function; in the other cases, a large variability in the misclassification error rate has 
been observed: the misclassification error rate lies in the interval [11%, 47%], the 
quartiles are Q\ = Qi = 11%, 03 = 17% and the 90-th percentile is 36%. 



Table 1 Range, three quartiles and 90-th percentile of the error rate over 100 runs of the EM with 
a mixture of t-distributions: comparison among the unconstrained and constrained algorithms for 
both weak and strong homoscedasticity 



Mixture model 


Range 


2i 


Error rate (%) 
Qi 


Qi 


X90 


Failures (%) 


No constraints 


11^7 


11 


11 


17 


36 


6 


Weak homoscedasticity 


11^9 


11 


11 


12 


12 


0 


Strong homoscedasticity 


16-50 


18 


50 


50 


50 


0 
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Afterward, we run the EM algorithm, using the same set of 100 starting points 
and imposing weak homoscedasticity. At the m-th iteration of the EM algorithm, 
such constraints have been implemented according to the following steps: 



1. Obtain the spectral decomposition of the covariance matrices lij 



(m) 



namely Z'/"'' = T 

the diagonal matrix of the eigenvalues of and E is the orthogonal matrix 
whose columns are the standardized eigenvectors of Y, (1 = 1, 2) 

2. Compute the weighted mean of the diagonal matrices of eigenvalues, i.e., = 



(m) _ -pOn) A (m)p'(m) 



and = 



-.(m) . (m)p'(m) 



where A 






(m) . (m) 



'1 






(m) . (m) 



+ aT’ A 



3. 



Set E*™’ = and xf > 



p(™)A(™)j'h™) 



The Vi {i = 1 , 2) are estimated by computing their MLE and solving the obtained 
equation by a one-dimensional search (such as Newton’s method). Afterward, fol- 
lowing Peel and McLachlan (2000), we imposed that the two components have the 
same degrees of freedom, just setting them to their weighted mean. Finally, in (10) 
we set c = 0.01. 

No failure of the algorithm has been observed, moreover the misclassification 
error rate lies in the interval [11%, 49%], the three quartiles are Qi = Qi = 11%, 
03 = 12% and the 90th percentile is 12%. We remark that the misclassification 
error rate of 49% was observed just once, while a value greater or equal to 36% 
appeared only in six cases over 100 runs. Finally we executed the 100 randomly 
initialized runs under the strong homoscedasticity assumption, observing a worse 
performance in classification. The results have been summarized in Table 1. 

Turning now to robust classification, along the schema in Peel and McLachlan 
(2000), we inserted outliers in the original data set by adding various values of a con- 
stant ^ to the second variate of the 25th data point, with ^ e {— 15, — 10, — 5, 0, 5, 10, 
15,20}. Numerical studies concerned normal mixture modeling and both con- 
strained and unconstrained t-mixture modeling. 

The results concerning the overall misclassification have been listed in Table 2, 
where columns 1 and 2 come from Peel and McLachlan (2000). It may be useful to 
read this table beginning from the fourth row, where the initial data set, without any 



Table 2 Comparison of error rates when fitting normal, t mixtures and weak constrained t- 
mixtures to the Crab data set with outliers. The best results of the error rate of the constrained 
algorithm are given, followed by their percent frequency, in parenthesis 





Normal mixture 


t mixture 


Weak constrained t mixture 




en'or rate (%) 


en'or rate (%) 


Error rate (%) 


y 


-15 


49 


19 


13 (87) 


5.76 


-10 


49 


19 


13 (86) 


6.13 


-5 


21 


20 


12 (88) 


8.57 


0 


19 


18 


11 (94) 


24.17 


5 


21 


20 


13 (94) 


13.64 


10 


50 


20 


13 (95) 


7.76 


15 


47 


20 


13 (93) 


6.68 


20 


49 


20 


13 (86) 


6.17 
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perturbation, is considered (as the constant value ^ is null). The weak constrained 
f -mixture EM algorithm overperforms the normal mixture and the unconstrained t- 
mixture, reaching a best error rate of 1 1%, attained in 95% of cases. Then, looking 
through the following rows of the table, the value of ^ raises progressively from 5 
to 20 and the error rate for the normal mixture strongly increases, till attaining the 
maximum value of 50%. It grows much more slowly for t-mixtures (reaching only 
20%), while it remains almost unchanged for the weakly constrained f-mixtures, 
assuming steadily the value of 13% (with a very high frequency in all cases, ranging 
from 86% to 94%). Considering negative values of the perturbing constant the 
obtained error rates are almost the same as those obtained for the positive values of 

The last consideration, referring again to Table 2, concerns the degrees of free- 
dom. In the case ^ = 0, the estimated degree of freedom is v = 24.71 and this 
indicates that we are not too far from the normality. In the other cases, the degrees of 
freedom decrease and this shows how they afford protection against nonnormality, 
downweighting the contribution of the outliers in the estimation of the parameters. 



5 Conclusions 

In this paper, we have proposed a new kind of constraint for mixture models based 
on the idea of weakly homoscedastic covariance matrices. This assumption requires 
only that the covariance matrices of the mixture component densities have the same 
set of eigenvalues. From a geometrical point of view, it amounts to imposing only 
that the ellipsoids of equal concentration have the same volume and shape. 

Allowing the orientation of the ellipsoids to change between clusters, we obtain 
a more parsimonious and easily interpreted model, particularly useful in a variety of 
real datasets (see, among others, the many examples in Banfield & Raftery, 1993; 
and Fraley & Raftery, 2002), when the covariance matrices have a common basic 
structure, even if they are not homoscedastic. 

The idea has been presented in the context of the mixtures of f -distributions, but 
these concepts can be applied in the more general framework of mixture modeling 
based on elliptical distributions, see Fang and Anderson (1990). The weak con- 
straint implemented for the maximum likelihood in the EM algorithm aims at four 
purposes: singularity elimination, reduction of the number of spurious maximizers, 
parsimonious modeling and adequateness of the constraints to the given data set. In 
particular, under the multivariate normal assumption for the group-conditional data, 
the hypothesis of weak homoscedasticity can be easily tested by means of F-tests 
on equality of variances. 

The numerical studies we have presented involve the dataset which suggested 
the problem. The results show the performance of this approach in clustering and 
data modeling. Further research about weak homoscedasticity is needed, both in its 
theoretical and computational aspects. From a practical point of view, we point out 
that weak homoscedastic constraints are easy to implement in the EM algorithm 
because at each step of the procedure it suffices to work with the weighted mean of 
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the ordered eigenvalues obtained by the spectral decomposition of the covariance 
matrices. 
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Bayesian Methods for Graph Clustering 



Pierre Latouche, Etienne Birmele, and Christophe Ambroise 



Abstract Networks are used in many scientific fields such as biology, social sci- 
ence, and information technology. They aim at modelling, with edges, the way 
objects of interest, represented by vertices, are related to each other. Looking for 
clusters of vertices, also called communities or modules, has appeared to be a pow- 
erful approach for capturing the underlying structure of a network. In this context, 
the Block-Clustering model has been applied on random graphs. The principle of 
this method is to assume that given the latent structure of a graph, the edges are 
independent and generated from a parametric distribution. Many EM-like strategies 
have been proposed, in a frequentist setting, to optimize the parameters of the model. 
Moreover, a criterion, based on an asymptotic approximation of the Integrated Clas- 
sification Likelihood (ICL), has recently been derived to estimate the number of 
classes in the latent structure. In this paper, we show how the Block-Clustering 
model can be described in a full Bayesian framework and how the posterior distri- 
bution, of all the parameters and latent variables, can be approximated efficiently 
applying Variational Bayes (VB). We also propose a new non-asymptotic Bayesian 
model selection criterion. Using simulated data sets, we compare our approach to 
other strategies. We show that our criterion can outperform ICL. 

Keywords Bayesian model selection • Block-clustering model • Integrated classifi- 
cation likelihood • Random graphs ■ Variational Bayes • Variational EM 



1 Introduction 



For the last few years, networks have been increasingly studied. Indeed, many 
scientific fields such as biology, social science, and information technology, see 
those mathematical structures as powerful tools to model the interactions between 
objects of interest. Examples of data sets having such structures are friendship and 
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protein-protein interaction networks, powergrids, and the Internet. In this context, 
a lot of attention has been paid on developing models to learn knowledge from the 
network topology. Many methods have been proposed, and in this work, we focus 
on statistical models that describe the way edges connect vertices. 

A well known strategy consists in seeing a given network as a realization of a 
random graph model based on a mixture distribution (Snijders & Nowicki, 1997; 
Daudin, Picard, & Robin, 2008). The method assumes that, according to its connec- 
tion profile, each vertex belongs to a hidden class of a latent structure and that, given 
this latent structure, all the observed edges are independent and binary distributed. 
Many names have been proposed for this model, and in the following, it will be 
denoted MixNet, which is equivalent to the Block-Clustering model of Snijders and 
Nowicki (1997). 

A key question is the estimation of the MixNet parameters. So far, the opti- 
mization procedures that have been proposed are based on heuristics or have been 
described in a frequentist setting (Daudin et al., 2008). Bayesian strategies have 
also been developed but are limited in a sense that they can not handle large 
networks. All those methods face the same difficulty. Indeed, the posterior dis- 
tribution /»(Z|X, O', jr), of all the latent variables Z given the observed edges X, 
can not be factorized. To tackle such problem, Daudin et al. proposed a variational 
approximation of the posterior, which corresponds to a mean-field approximation. 

Another difficulty is the estimation of the number of classes in the mixture. 
Indeed, many criteria, such as the Bayesian Information Criterion (BIC) or the 
Akaike Information Criterion (AIC) are based on the likelihood p(X|a, :7r) of the 
incomplete data set X, which is intractable here. Therefore, Mariadassou and Robin 
(2007) derived a criterion based on an asymptotic approximation of the Integrated 
Classification Likelihood (also called Integrated Complete-data Likelihood). More 
details can be found in Biernacki, Celeux, and Govaert (2000). They found that this 
criterion, that we will denote ICL for simplicity, was very accurate in most situations 
but tended to underestimate the number of classes when dealing with small graphs. 
We emphasize that ICL is currently the only model based criterion developed for 
MixNet. 

In this paper, we extend the work of Hofman and Wiggins (2008) who devel- 
oped a variational Bayes algorithm to learn ajfiliation models. These are defined 
by only two probabilities of connection X and e. Given a network, it is assumed 
that the edges connecting nodes of the same class were generated with probability 
A while edges connecting nodes of different classes were drawn with probability e. 
The algorithm that they proposed can cluster the nodes and estimate the probabilities 
A and e very quickly. However, affiliation models can not characterize the complex 
topology of most real networks, which have the majority of their nodes with none 
or very few links and exhibit some hubs which make them locally dense (Daudin 
et al., 2008). Therefore, we propose an efficient Bayesian version of MixNet, which 
allows vertices to have different topological behaviors. Thus, after having presented 
MixNet in Sect. 2, we introduce some prior distributions and describe the MixNet 
Bayesian probabilistic model in Sect. 3. We derive the model optimization equations 
using Variational Bayes and we propose a new criterion to estimate the number of 
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classes. Finally, in Sect. 4, we carry out some experiments using simulated data sets 
to compare the number of the estimated clusters obtained with the ICL criterion and 
the variational frequentist strategy, and our approach. 

An extended version of this paper with proofs of the results and more experiments 
is available (Latouche, Birmele, & Ambroise, 2008). 



2 A Mixture Model for Networks 

We consider an undirected binary random graph G, where V denotes a set of N 
hxed vertices and X = {X,y, (i, j) € V^} is the set of all the random edges. We 
assume that G does not have any self loop. Therefore, the variables Xu will not be 
taken into account. 

MixNet assumes that each vertex i belongs to an unknown class q among Q 
classes and the latent variable Zi reflects our uncertainty as to which one that is 

Z; ~ m{\, a. = {q' 1 , 0 ' 2 , • • . .ap}], 

where we denote a, the vector of class proportions. The edge probabilities are then 
given by 

Xij\{Zi,Zji = \} ^ B{Xij\n,i). 

Thus, contrary to affiliation models (Hofman & Wiggins, 2008), we consider & Q 'x. 
Q matrix n of connection probabilities. Note that in the case of undirected networks, 
n is symmetric. The latent variables in the set Z = (Zi, . . . , Zn) are iid and given 
this latent structure, all the edges are supposed to be independent. Thus, we obtain 

N N Q 

p{z\oi)=Y\M{A- i,«)=nri“^’ 

i=l i = \q=\ 



and 



N Q 

p(xiz, ^) = n iZi, Zj. ^) = n n ■ 

i<j i<j q.l 



3 Bayesian View of MixNet 
3.1 Bayesian Probabilistic Model 

We now show how MixNet can be described in a full Bayesian framework. To trans- 
form the MixNet frequentist probabilistic model, we first specify some prior distri- 
butions for the model parameters. To simplify the calculations, we use conjugate 
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priors. Thus, since p{Z,\\a) is a multinomial distribution, we choose a Dirichlet 
distribution for the mixing coefficients: 



= {;jj, . . . , = Dir(o:; n®) 



r («?)... r (4) 




9=1 



where we denote n®, the prior number of vertices in the ^-th component of the 
mixture. In order to obtain a posterior distribution influenced primarily by the net- 
work data rather than the prior, small values have to be chosen. A typical choice is 
«® = i, yq. This leads to a non-informative Jeffreys prior distribution. It is also 
possible to consider a uniform distribution on the Q — \ dimensional simplex by 
fixing = 1, V< 7 . 

Since p{Xij |Zi, Zj, n) is a Bernoulli distribution, we use Beta priors to model 
the connectivity matrix n : 



p{n\rf = = 




Q 

]^Beta(:r,,; 

9<1 



n 






r(4 + ^ 9 °/) 






( 1 ) 



where and ^ represent respectively the prior number of edges and non-edges 
connecting vertices of cluster q to vertices of cluster /. A common choice consists 
in setting ^ = 1, V^. This gives rise to a uniform prior distribution. Since 

Tt is symmetric, only the terms of the upper or lower triangular matrix have to be 
considered. This explains the product over ^ < / . 

Thus, the model parameters are now seen as random variables. They depend on 
parameters n®, i;®, and ^® which are called hyperparameters in the Bayesian litera- 
ture (MacKay, 1992). The joint distribution of the Bayesian probabilistic model is 
then given by 

/?(X, Z. a. n:|n®, >/®, ^®) = p(X|Z, jr)/?(Z|a)/?(a!|n®)p(n:|>/®, ^®). 



For the rest of the paper, since the prior hyperparameters are fixed and in order 
to keep the notations simple, they will not be shown explicitly in the conditional 
distributions. 



3.2 Variational Inference 

The inference task consists in evaluating the posterior p(Z. a. jr|X) of all the hid- 
den variables (latent variables Z and parameters a and n) given the observed 
edges X. Unfortunately, under MixNet, this distribution is intractable. To overcome 
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such difficulties, we follow the work of Attias (1999) and Corduneanu and Bishop 
(2001) on Bayesian mixture modelling and Bayesian model selection. Thus, we first 
introduce a factorized distribution: 

N 

q(Z. a, jt) = q(a)q{7t)q(Z) = q{a)q{7t) ]~[ q(Zi), 

i = \ 

and we use Variational Bayes to obtain an optimal approximation q(Z,a,ir) of 
the posterior. This framework is called the mean field theory in physics (Parisi, 
1988). The Kullback-Leibler divergence enables us to decompose the log-marginal 
probability, usually called the model evidence or the log Integrated Observed-data 
Likelihood, and we obtain 

\np(X) = C{q(.)) + KL(^(.) || ;^(.|X), (2) 

where 

= g(Z.tt.tr)ln| dudjt, (3) 

and 



Kl(^(.) II ;^(.|X)^ g(Z, «■ jt) I n | | dadir. (4) 

Minimizing (4) is equivalent to maximizing the lower bound (3) of (2). However, 
we now have a full variational optimization problem since the model parameters are 
random variables and we are looking for the best approximation q{Z. a. n) among 
all the factorized distributions. In the following, we use a variational Bayes EM 
algorithm. We call Variational Bayes E-step, the optimization of each distribution 
qiZi) and Variational Bayes M-step, the approximations of the remaining factors. 
We derive the update equations only in the case of an undirected graph G without 
self-loop. Our algorithm cycles through the E and M steps until convergence of the 
lower bound (11). 



3.2.1 Variational Bayes E-Step 

The optimal approximation at vertex i is 

q{Zi) = M{Zi- \,n= {t,i, . . . , T,e}), (5) 

where t,^ is the probability (responsibility) of node i to belong to class q. It satisfies 
the relation: 

N Q 
j¥=il=i 



(6) 
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where i/f (.) is the digamma function. Given a matrix , the algorithm builds a new 
matrix t”®"' where each row satisfies (6). It then uses to build a new matrix and 
so on. It stops when ~ < ^- A rather small values for e has 

to be chosen. In the experiments that we carried out, we chose e = 10“’^^. 



3.2.2 Variational Bayes M-Step: Optimization of q{a) 

The optimization of the lower bound with respect to q{a) produces a distribution 
with the same functional form as the prior pia)'. 

q(cL) = Dir(a; n), (7) 

where Ug = T? is the pseudo number of vertices in the q-th component 

of the mixture. 



3.2.3 Variational Bayes M-Step: Optimization of q(it) 

Again, the functional form of the prior p{7t) is conserved through the variational 
optimization: 

Q 

^(jt) = (8) 

where qgi and i^gi represent respectively the pseudo number of edges and non- 
edges connecting vertices of cluster q to vertices of cluster 1. For q ^ I, the 
hyperparameter r]gi is given by 

N N 

'i?' = 4i + J2 and yq : qgg = rflg + Y^ XijngTjg. (9) 

‘¥=j ‘<J 

Moreover, for ^ ^ I, the hyperparameter ^gi is given by 

N N 

= ^ql + “ Xij)XigXjl, and : i^gg = -f ^(1 - Xij)XigXjg. (10) 

‘¥=j ‘<J 



3.2.4 Lower Bound 

The lower bound takes a simple form after the Variational Bayes M-step. Indeed, it 
only depends on the posterior probabilities t,^ as well as the normalizing constants 
of the Dirichlet and Beta distributions: 
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L{q{.)) =ln 



+ 



j r(Ef=i»°)nf=ir(«g) | 

I r(Ef=i«9)nf=i r(«°)j 



N Q 

Y^^XiqlnXig. 
1=1 q=\ 



( 11 ) 



3.3 Model Selection 

We have not addressed yet the problem of estimating the number of classes in the 
mixture. Given a set of values of Q , our goal is to select Q * which maximizes the 
log-probability of the observed edges In />(X| g). Unfortunately, this quantity does 
not have any analytical expression. Indeed, for each value of Q, it involves inte- 
grating over all the hidden parameters as shown in Sect. 3.2. Nevertheless, it can 
be approximated using our Variational Bayes algorithm. Given a value of Q, the 
algorithm is used to maximize the lower bound (11). Meanwhile, the Kullback- 
Leibler divergence between the factorized and the unknown posterior distribution 
decreases. After convergence, although this distance can not be computed analyti- 
cally, we expect it to be close to zero, and therefore, we can use the lower bound as 
an approximation of In />(X| 2)- This procedure is repeated for the different values 
of Q considered. 

Given a value of Q, since MixNet is a mixture model, for any given setting of the 
parameters a and n there will be a total of Q ! parameters which lead to the same dis- 
tribution over the edges. These parameters would differ only through re-labelling of 
the components. In a frequentist framework, this redundancy is irrelevant since we 
only look for point estimates of the model parameters. In a Bayesian setting, how- 
ever, we integrate over all possible parameter values. Since />(X| g) is multimodal, 
variational techniques will tend to approximate the distribution in the neighborhood 
of one of the mode and ignore the others (see Bishop, 2006). Thus, when comparing 
different values of 0, we need to take this multimodality into account. As a conse- 
quence, we define a criterion by subtracting a term In Q ! from the lower bound (11) 
computed previously. 

In the case of networks, we emphasize that our work led to the first criterion 
based on a non-asymptotic approximation of the model evidence, also called Inte- 
grated Observed-data likelihood. When considering other types of mixture models, 
Biernacki et al. (2000) showed that such criteria were very powerful to select the 
number of classes. 

4 Experiments 

We present some results of the experiments we carried out to assess our Bayesian 
version of MixNet and the model selection criterion we proposed in Sect. 3.3. 
Through all our experiments, we compared our approach to the work of Daudin 
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et al. (2008) who used ICL as a criterion to identify the number of classes in 
latent structures and the frequentist approach of variational EM to estimate the 
model parameters. We considered synthetic data, generated according to known ran- 
dom graph models and we concentrated on analyzing the capacity of ICL and our 
criterion to retrieve the true number of classes in the latent structures. 



4.1 Comparison of the Criteria 

In these experiments, we consider simple affiliation models where only two types of 
edges exist: edges between nodes of the same class and edges between nodes of dif- 
ferent classes. Each type of edge has a given probability, respectively jiqq = X and 
jiqi = e. Eollowing Mariadassou and Robin (2007) who showed that ICL tended 
to underestimate the number of classes in the case of small graphs, we generated 
graphs with only N = 50 vertices to analyze the robustness of our criterion. More- 
over, to limit the number of free parameters, we studied the case where X = I — e. 
Thus, we considered three affiliation models shown in Table 1 . 

Eor each affiliation model, we analyzed graphs with Qrrue £ {2, . . . , 5} classes 
mixed in the same proportions ai = • ■ • = = ^2 — Thus, we studied a total 

of 12 graph models. 

Eor each of these graph models, we simulated 100 networks. In order to esti- 
mate the number of classes in the latent structures, we applied our algorithm and 
the variational EM approach of Daudin et al. (2008) on each network, for various 

numbers of classes Q e {1, 6}. Note that, we chose = 1, Vq € {I, . . . , Q} 

for the Dirichlet prior and = 1, 'V(q, 1) e {1, • • • , Qf for the Beta pri- 

ors. We recall that such distributions correspond to uniform distributions. Like any 
optimization technique, the Bayesian and frequentist methods depend on the initial- 
ization. Thus, for each simulated network and each number of classes Q , we started 
the algorithms with five different initializations of t obtained using a spectral clus- 
tering method (Ng, Jordan, & Weiss, 2001). Then, for the Bayesian algorithm, we 
used the criterion we proposed in Sect. 3.3 to select the best learnt model, whereas 
we used ICL in the frequentist approach. Finally, for each simulated network, we 
obtained two estimates Qicl and Qvb of the number Qjrue of latent classes by 
selecting Q e {1 6} for which the corresponding criteria were maximized. 

In Table 2, we observe that for the most structured affiliation model, the two 
criteria always estimate correctly the true number of classes except when Qrrue = 
5. In this case, the Bayesian criterion performs better. Indeed, it has a percentage of 
accuracy of 95% against 87% for ICL. 



Table 1 Parameters of 
the three affiliation models 
considered 



Model 


A 


6 


1 


0.9 


0.1 


2 


0.85 


0.15 


3 


0.8 


0.2 
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Table 2 Confusion matrices for ICL and Bayesian (based on Variational Bayes) criteria. A = 0.9, 
€ = 0.1 and QTrm e {2, . . . , 5} 





1 2 


3 


4 


5 6 




1 


2 


3 


4 


5 6 


2 


0 100 


0 


0 


0 0 


2 


0 100 


0 


0 


0 0 


3 


0 0 


100 


0 


0 0 


3 


0 


0 


100 


0 


0 0 


4 


0 0 


0 


100 


0 0 


4 


0 


0 


0 


100 


0 0 


5 


0 0 


0 


13 


87 0 


5 


0 


0 


0 


4 


95 1 



(3)QTrue\QlCL (^)QTrue\QvB 



Table 3 Confusion matrices for ICL and Bayesian (based on Variational Bayes) criteria. A = 
0.85, € =0.15 and Qjrue £ {2, .... 5} 





1 


2 


3 


4 


5 6 




1 


2 


3 


4 


5 6 


2 


0 100 


0 


0 


0 0 


2 


0 100 


0 


0 


0 0 


3 


0 


0 


100 


0 


0 0 


3 


0 


0 


100 


0 


0 0 


4 


0 


0 


1 


98 


1 0 


4 


0 


0 


0 


98 


2 0 


5 


0 


0 


10 


61 29 0 


5 


0 


0 


1 


29 65 5 



(^)QTrue\QlCL (^)QTrm\QvB 



Table 4 Confusion matrices for ICL and Bayesian (based on Variational Bayes) criteria. A = 0.8, 
e = 0.2 and Qrrue £ {2 5} 





1 


2 


3 


4 5 6 




1 


2 


3 


4 


5 6 


2 


0 100 


0 


0 0 0 


2 


0 100 


0 


0 


0 0 


3 


0 


0 


100 


0 0 0 


3 


0 


0 


100 


0 


0 0 


4 


0 


0 


14 


86 0 0 


4 


0 


0 


5 


94 


1 0 


5 


0 


17 


36 


44 3 0 


5 


0 


4 


18 


43 29 6 



(S.)QTriie\QlCL (^)QTrue\QvB 



These differences increase when considering less structured networks. For 
instance, in Tables 3 and 4, when Qjrue = 5, we notice that the percentage of accu- 
racy of ICL falls down (respectively 29% and 3%) whereas the Bayesian criterion 
remains more stable (respectively 65% and 29%). Thus, when considering weaker 
and weaker modular structures, both criteria tend to underestimate the number of 
classes although the Bayesian criterion appears to be much more stable. 

In all the tables presented before, we did not specify what happens when 
Q True = 1 ■ Indeed, both techniques have always a 100% accuracy. We did not stipu- 
late either what happens when Qrrue = 6. In general, our results were very similar 
to what we obtained when considering Qrrue = 5. We also used the Adjusted 
Rand Index (Hubert & Arable, 1985) to evaluate the agreement between the true 
and estimated partitions. The computation of this index is based on a ratio between 
the number of node pairs belonging to the same and to different classes when con- 
sidering the true partition and the estimated partition. Two identical partitions have 
an adjusted Rand index equal to 1. In the experiments we carried out, when the vari- 
ational EM method and our algorithm were run on networks with the true number 
of latent classes, we obtained almost non-distinguishable Adjusted Rand Indices. 
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Moreover, we point out that we obtained almost the same results in this set of exper- 
iments by choosing uniform distributions (n° = 1, e {1,..., Q}) or Jeffreys 
distributions Wq e {1 Q}) for the prior over the mixing coeffi- 

cients. Finally, we compared the computational costs of the frequentist approach of 
variational EM and our Variational Bayes algorithm. Both are equal to O(Q^N^). 
Analyzing a sparse network with 200 nodes takes about a minute, and about a hour 
for dense networks. 



5 Conclusion 

In this paper, we showed how the MixNet model, also called the Block-Clustering 
model, could be described in a full Bayesian framework. Thus, we introduced pri- 
ors over the model parameters and we developed a procedure, based on Variational 
Bayes, to approximate the posterior distribution of all the hidden variables given the 
observed edges. In this framework, we derived a new non-asymptotic Bayesian cri- 
terion to select the number of classes in latent structures. We found that our criterion 
was more relevant than the criterion we denoted ICL in this paper and which is based 
on an asymptotic approximation of the Integrated Classification Likelihood. Indeed, 
by considering small networks and complex modular structures, we found that the 
percentage of accuracy of our criterion was always higher. Overall, our Bayesian 
approach seems very promising for the investigation of rather small networks and/or 
based on complex structures. 
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Determining the Number of Components 
in Mixture Models for Hierarchical Data 
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Abstract Recently, various types of mixture models have been developed for data 
sets having a hierarchical or multilevel structure (see, e.g., Vermunt, Sociologi- 
cal Methodology 33:213-239, 2003; Computational Statistics and Data Analysis 
51:5368-5376, 2007). Most of these models include finite mixture distributions at 
multiple levels of a hierarchical structure. In these multilevel mixture models, selec- 
tion of the number of mixture component is more complex than in standard mixture 
models because one has to determine the number of mixture components at multiple 
levels. 

In this study the performance of various model selection methods was investi- 
gated in the context of multilevel mixture models. We focus on determining the 
number of mixture components at the higher-level. We consider the information 
criteria BIC, AIC, and AIC3, and CAIC, as well as ICOMP and the validation log- 
likelihood. A specific difficulty that occurs in the application of BIC and CAIC in 
the context of multilevel models is that they contain the sample size as one of their 
terms and it is not clear which sample size should be used in their formula. This 
could be the number of groups, the number of individuals, or either the number of 
groups or number of individuals depending on whether one wishes to determine the 
number of components at the higher or at the lower level. 

Our simulation study showed that when one wishes to determine the number 
of mixture components at the higher level, the most appropriate sample size for 
BIC and CAIC is the number of groups (higher-level units). Moreover, we found 
that BIC, CAIC and ICOMP detect very well the true number of mixture compo- 
nents when both the components’ separation and the group-level sample size are 
large enough. AIC performs best with low separation levels and small sizes at the 
group-level. 

Keywords AIC • BIC • ICOMP • Mixture components ■ Multilevel latent class 
analysis • Multilevel mixture model • Validation log-likelihood. 
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1 Introduction 

Vermunt (2003, 2005, 2007, 2008) proposed several types of latent class (LC) and 
mixture models for multilevel data sets with applications in sociological, behavioral, 
and medical research. Examples of two-level data sets include data from individuals 
(lower-level units) nested within families (higher-level units), pupils nested within 
schools, patients nested within primary care centers, and repeated measurements 
nested within individuals. A multilevel latent class model can be applied when in 
addition multiple responses are recorded for the lower-level units, and is thus, in 
fact, a model for three-level data sets. The multilevel LC models dealt with in this 
paper assume that lower-level units (say individuals) belong to LCs at the lower 
level and that higher-level units (say groups) belong to LCs at the higher level. In 
other words, the models contain mixture distributions at two levels. 

There is wide variety of literature available on the performance of model selec- 
tion statistics for determining the number of mixture components in mixture models. 
The Bayesian (also known as Schwarz’s) information criterion (BIC) is the most 
popular measure for determining the number of mixture components and it is gen- 
erally considered to be a good measure (Hagenaars & McCutcheon, 2002; Nylund, 
Muthen, & Asparouhov, 2003). Other authors, however, prefer the Akaike informa- 
tion criterion (AlC) (Leroux, 1992). While deciding about the number of mixture 
components is already a complicated task in standard mixture models, it is even 
more complex for multilevel mixture models. One of the difficulties consists in 
choosing the appropriate sample size in the BIC and CAIC formulae: 

BIC = -2\nL + k\n(n) (1) 

and 

CAIC = -2\nL + k(\+\n(n)). (2) 

Here, L is the maximized value of the likelihood function for the estimated model, 
k is the number of free parameters to be estimated, and n is the number of obser- 
vations, or equivalently, the sample size. There are several options for defining the 
sample size in the multilevel context, including the number of groups, the number of 
individuals, or either the number of groups or number of individuals depending on 
whether one wishes to determine the number of components at the higher or at the 
lower level. Neither the literature on mixture models nor the literature on multilevel 
analysis give hints on what sample size to use in the computation of BIC and CAIC 
in multilevel mixture models. 

This article presents the results of a simulation study in which we compared the 
performance of several methods for determining the number of mixture components 
in the multilevel LC models. We investigated the performance of BIC and CAIC 
using different sample size definitions, as well as compare BIC and CAIC with other 
model selection measures, such as AIC, AIC3, ICOMP (Bozdogan, 1993), and the 
validation log-likelihood (Smyth, 2000). Our focus is on deciding about the number 
of mixture components at the higher level. 
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The next section describes the multilevel LC model. The design of the simulation 
study is explained in Sect. 3. The obtained result are presented in Sect. 4. The main 
conclusions are highlighted in the last section. 



2 Multilevel Latent Class Model 

Let y j = (jji, . . . , yji, . . . , yji) denote the vector with the / responses of individ- 
ual j ,{j = 1 , . . . , n). A discrete LC variable is denoted by Xj , a particular LC by 
I 2 , and the number of classes by L 2 (h = L • ■ • - L2). The basic assumptions of the 
LC model are (1) that each individual belongs to (no more than) one latent class, 
(2) that the responses of individuals belonging to the same LC are generated by the 
same (probability) density, and (3) that the / responses of individual j are condi- 
tionally independent of one another given his/her class membership. Under these 
assumptions, the traditional LC model is defined by the following formula: 



where f{yj) is the marginal density of the responses of individual 7, P(xj = I 2 ) 
is the unconditional probability of belonging to LC I 2 , and fiyji \xj = I 2 ) is the 
conditional density for response variable i given that one belongs to LC h- 

A multilevel LC model differs from a standard LC model in that the parameters 
of interest are allowed to differ randomly across groups (across higher-level units). It 
should be noted that the multilevel LC model is actually a model for three-level data 
sets; that is, for multiple responses (level- 1 units) nested within individuals (level-2 
units) and individuals (level-2 units) are nested within groups (level-3 units). The 
random variation of LC parameters across groups can be modelled using contin- 
uous or discrete group-level latent variables, or by a combination of these two. It 
should be noted that using the discrete latent variable approach, where parameters 
are allowed to differ across latent classes of groups, is similar to using a nonpara- 
metric random effects approach (Aitkin, 1999; Vermunt, 2004). In this article we 
focus on this discrete approach which makes use of group-level latent classes. 

Let ykj = (ykji, ■ ■ ■ , ykji, ■ ■ ■ , ykji) denote the I responses of individual 7 
(7 = 1 ,..., «<:) from group /:(/: = 1 , • • • , .^f), and = {yk\^---,ykj,----yknj) 
the full response vector of group k. The class membership of individual 7 from 
group k is now denoted by Xkj ■ In the discrete random-effects approach it is assumed 
that every group belongs to one of the L3 group-level LCs or mixture components. 
Let Wk denote the class membership of group k and (3 denote a particular group- 

level LC ((3 = 1, L3). The multilevel LC model can then be described by the 

following two equations: 




(3) 
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fiyk) = XI 



^ 3=1 



h) n f{ykj\wk = h) 

7=1 



(4) 



and 



Li I 

fiykj\wk = ^ 3 ) = X ^ = /3)]~[/(W;ik7:; = h,Wk = h) ■ (5) 

h=\ i=l 

Equation (4) shows how the responses of the nk individuals belonging to group k are 
linked to obtain the density for the full response vector of group k, f {yk)- More pre- 
cisely, it shows that the group members’ responses are assumed to be mutually inde- 
pendent conditional on the group-level class membership. Furthermore, from (5) it 
can be seen that both the lower-level mixture proportions - P{xkj =l 2 \wk = h)- 
and the parameters defining the response densities - f{ykji\xkj = h,Wk = h) - 
may differ across higher-level mixture components. 

Two interesting special cases of the multilevel LC model are obtained by con- 
straining the terms appearing in (5) (Vermunt, 2004, 2008). The first special case, 
which is the one we will use in our simulation study, is a model in which the 
individual-level class membership probabilities differ across group-level classes, but 
in which the parameters defining the conditional distributions for the response vari- 
ables do not vary across group-level classes. The latter implies that f{ykji\x.kj = 
h, Wk = h) = fiykji \xkj = h)- The second special case is a model in which the 
parameters defining the conditional distributions for the response variables differ 
across group-level classes, but in which individual-level class membership prob- 
abilities do not vary across group-level classes. The latter restriction implies that 
P{xkj = h\wk = h) = P{xkj = ( 2 ) ■ The first special case is the most natural spec- 
ification if one uses the multilevel LC models a multiple-group LC model for a large 
number of groups. The second one is more similar to three-level random-effects 
regression analysis. 

The unknown parameters of a multilevel LC model can be estimated by means 
of Maximum Likelihood (ML). For this purpose one can use the Expectation- 
Maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977) - the most popular 
algorithm for obtaining ML estimates in the context of mixture modeling - which 
in the context a multilevel LC model requires a specific implementation of the E 
step. As shown by Vermunt (2003, 2007), the relevant marginal posterior prob- 
abilities can be computed in an efficient way by making use of the conditional 
independence assumptions implied by the multilevel LC model. This special ver- 
sion of the EM algorithm, as well as a Newton-Raphson algorithm with analytic 
first-order derivatives and numerical second-order derivatives are implemented in 
the Latent GOLD software package (Vermunt & Magidson, 2008). The last version 
of the Latent GOLD software package was used for the realization of the simulation 
study reported below. 
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3 Design of the Simulation Study 

The purpose of the simulation study was to compare the performance of different 
model selection indices for determining the number of mixture components at the 
higher-level in the multilevel LC model. These indices are BIC, AIC, A1C3, CAIC, 
ICOMP, and the validation log-likelihood. For BIC and CAIC we use two versions, 
one with the number of groups and one with the total number of individuals as the 
sample size. 

Because we focus on detecting the correct number of group-level classes rather 
than on detecting the correct number of individual-level classes, we decided to keep 
the LC structure at the individual level fixed in our simulation design. More specifi- 
cally, we used a three-class model (L 2 = 3) for six binary responses (/ = 6 ). The 
class-specific “positive” response probabilities - = \\xkj = / 2 ) - for the 

six items were set to { 0 . 8 , 0 . 8 , 0 . 8 , 0 . 8 , 0 . 8 , 0 . 8 }, { 0 . 8 , 0 . 8 , 0 . 8 , 0 . 2 , 0 . 2 , 0 . 2 }, and 
{0.2, 0.2, 0.2, 0.2, 0.2, 0.2} for LCs 1, 2, and 3, respectively. So LC 1 has a high 
probability of giving the positive response for all items, LC 3 a low probability for 
all items, and LC 2 a low probability for 3 items and a high probability for the other 
3 items. Our choice of number of items, number of classes, and response probabili- 
ties is such that we obtain a condition with moderately separated classes. To give an 
impression of the level of the separation, our setting corresponds to an entropy based 
R-squared - a measure indicating how well one can predict the class memberships 
based on the observed responses - of about 0.63. By using moderately separated 
classes at the lower level, we make sure that detection of the group-level classes is 
neither made too easy nor too difficult as far as this part of the model is concerned. 

So far we have discussed the factors that were fixed in the simulation study. The 
three factors which were varied are the lower-level sample size, the higher-level 
sample size, and the number of LCs at the higher-level. Previous simulation studies 
have shown that the sample size, the number of classes, and the level of separa- 
tion between the classes are the most important factors affecting the performance 
of model selection measures in the context mixture models (Dias, 2006). It should 
be noted that the separation between the higher-level classes can be manipulated in 
several ways; that is, by increasing the level of separation of the lower-level classes, 
by increasing the number of individuals per group (the lower-level sample size itk), 
and by making the P{xtj \ wk) more different across values of Wk- We used only the 
lower-level sample size itk to manipulate the level of separation. More specifically 
by using itk = 5, 10, 15, 20 and 30 for the number of the lower-level units per 
higher-level unit, we created conditions ranging from very low to very high sep- 
aration. The corresponding entropy-based R-squared values are given below after 
discussing the other design factors. 

The other two factors that were varied are the higher-level sample size, for which 
we used K = 50 and 500, and the number of classes at the higher level, for which 
we used L 3 = 2 and 3. In the condition with two higher-level classes, the model 
probabilities were set to P{wk = {1,2}) = {0.5, 0.5}, P{xkj = {1,2,3}|H'^. = 
1) = {Q.2,0.2,0.6}, and P{xkj = {l,2,3}|wi, = 2) = {0.4, 0.4, 0.2}. These prob- 
abilities are such that the two LCs are moderately distinguishable. The condition 
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with three LCs at the higher-level was created by splitting the above second class 
into two new classes. For this condition, the model probabilities were P{yvk = 
{1,2,3}) = {0.5,0.25,0.25}, P(xkj = {l,2,3}|wi = 1) = {0.2, 0.2, 0.6}, 
P{xkj = {l,2,3}|w,t = 2) = {0.2, 0.6, 0.2}, and = {l,2,3}|wi = 3) = 
{0.6, 0.2, 0.2}. Also here we have moderately different group-level classes. The five 
different itk settings yielded entropy-based R-squared values of 0.35, 0.57, 0.71, 
0.80, and 0.90 for the 2 class condition, and 0.36, 0.58, 0.73, 0.82, and 0.92 for the 
3 class condition. This shows that in our settings separation was very much affected 
by but not so much by L 3 . 

In total the simulation study design contained 5 x 2 x 2 = 20 cells which are 
all possible combinations of the three design factors. For each of these cells we 
generate 100 data sets. With each data set we estimated multilevel LC models with 
a hxed number of LCs at the lower-level (L 2 = 3) and with different numbers of 
LCs at the higher-level. 



4 Results of the Simulation Study 

As was indicated above, the main goal of the simulation study was to determine 
which of the investigated model selection measures is preferable for the deciding 
about the number of higher-level mixture components in multilevel mixture models. 
For BIC and CAIC, which both have the sample size in their formula, we used two 
versions, one based on the number of higher-level observations {K) and one based 
on the total number of lower-level observations (Krik). 

Table 1 reports the results of our simulation study per design factor aggregated 
over the other two design factors. For each level of the three design factors and for 
each investigated fit measure, we indicate the number of simulation replications in 
which the true number higher-level latent classes was underestimated (L 3 < L 3 ), 
estimated correctly (L 3 = L 3 ), and overestimated (L 3 > L 3 ). 

Let us first have a look at the results for BIC and CAIC using the two different 
definitions for the sample size. From the results in Table 1, one can easily see that 
both for BIC and CAIC using the number of groups as sample size is the best option. 
Underestimation of the number of mixture components it is much more likely with 
BlC(Knk) or CAlC(Kttk) than with BIC(.Si) or CAIC(.^f). This is especially true 
for the conditions corresponding to low or moderate levels of separation (small or 
middle lit values), as well as for the smaller higher-level sample size. 

Comparison of the results of all eight investigated fit measures shows that overall 
AIC3 performs best. The results for BIC(^), CAIC{K), ICOMP are similar in the 
sense that they perform best when the number of individuals per group (the level of 
separation) is large enough {nt > 15). AIC, on the other hand, performs best when 
separation is weak («^. = 5) and when the sample size is small. As was found in 
other studies, AIC3 seems to provide a compromise between these two sets of mea- 
sures (Dias, 2006). In contrast to our expectations, the validation log-likelihood did 
not perform very well: it tends to overestimate the number of mixture components 
under all conditions. 
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Table 1 The number of simulation replicates in which the investigated fit measure underestimated, 
correctly estimated, and overestimated the number of group-level mixture components for each of 
the three conditions 
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K 


L 


3 


Total 


5 


10 


15 


20 


30 


50 


500 


2 


3 


BlC(Kiik) 


f,3 < L 3 


233 


131 


67 


18 


1 


400 


50 


136 


314 


450 




II 


167 


269 


333 


382 


399 


600 


950 


864 


686 


1,550 




L 3 > L 3 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


BlC(ff) 


L 3 < L 3 


199 


83 


26 


6 


0 


286 


28 


93 


221 


314 




1! 


201 


317 


374 


394 


399 


713 


972 


907 


778 


1,685 




L 3 > L 3 


0 


0 


0 


0 


1 


1 


0 


0 


1 


1 


CK\C(Krik) 


L 3 < L 3 


253 


146 


81 


33 


5 


456 


62 


153 


365 


518 




Li = Li 


147 


254 


319 


367 


395 


544 


938 


847 


635 


1,482 




Li > Li 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


CK\C(K) 


Li < Li 


228 


101 


46 


9 


0 


337 


47 


114 


270 


384 




II 


172 


299 


354 


391 


400 


663 


953 


886 


730 


1,616 




Li > Li 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


AIC 


Li < Li 


103 


45 


9 


3 


0 


158 


5 


41 


122 


163 




II 


278 


320 


344 


349 


355 


766 


880 


853 


793 


1,646 




Li > Li 


16 


35 


47 


48 


45 


76 


115 


106 


85 


191 


AIC3 


Li < Li 


155 


68 


13 


5 


0 


236 


5 


70 


171 


241 




II 


245 


323 


375 


389 


385 


745 


972 


904 


813 


1,717 




Li > Li 


0 


9 


12 


6 


15 


19 


23 


26 


16 


42 


ICOMP 


Li < Li 


208 


85 


20 


4 


0 


274 


43 


91 


226 


317 




Li = Li 


191 


310 


380 


392 


398 


714 


957 


900 


771 


1,671 




Li > Li 


1 


5 


0 


4 


2 


12 


0 


9 


3 


12 


Validation 


Li < Li 


78 


37 


9 


1 


0 


121 


4 


46 


79 


125 


log-likelihood 


II 

<►4 


215 


239 


272 


286 


291 


582 


721 


691 


612 


1,303 




Li > Li 


107 


124 


119 


113 


109 


297 


275 


263 


309 


572 



5 Conclusions 

Based on the simulation study we can draw two important conclusions. The first 
concerns the preferred sample size definition in the BIC and CAIC formulae. 
Our results show clearly that it is much better to use the number of higher-level 
units as the sample size instead of the total number of lower-level unit. Using 
the latter makes it much more likely that the number of mixture components 
is underestimated, especially if the separation between components is weak or 
moderate. 

The second set of conclusions concern the comparison of all investigated mea- 
sures. These results are very much in agreement with what is known from simulation 
studies on standard mixture models. BIC, CAIC, and ICOMP perform very well 
when the level of separation and the sample size are large enough. In contrast, AIC 
seems to be the preferable method when the sample size is small and when the 
level of separation is low. AIC3 offers a good compromise between the tendency of 
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BIC, CAIC, and ICOMP to underestimate the number of mixture components with 
low separation and small sample sizes and the tendency of AIC to overestimate the 
number of mixture components with higher separation and large sample sizes. 

As in any simulation study, we had to make various choices which limit the 
range of our conclusions. First of all, we concentrated on selecting the number of 
classes at the higher level assuming that the number of classes at the lower level 
is known. Further research is needed to determine whether the same conclusions 
apply for selecting the number of lower-level classes, or for selecting simultane- 
ously the number of lower- and higher-level classes. Second, we used a classical 
LC model for binary responses whereas multilevel mixture models can also be used 
with other types of response variables. Finally, we concentrated on the variant of the 
multilevel LC model in which only the lower-level class proportions differ across 
higher-level classes. As was shown when introducing the model, other multilevel 
LC models may assume that response variables are directly related to the group- 
level class membership. It seems to be useful to replicate our simulation study for 
other types of multilevel mixture models, as well as for response variables of other 
scale types. 
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Testing Mixed Distributions when the Mixing 
Distribution Is Known 



Denys Pommeret 



Abstract In this paper we present smooth goodness of fit tests for testing the 
mixture distribution of a sequence of i.i.d. random variables. We consider mix- 
ture models when the mixing distribution is known. We adapt a Schwarz’s criteria 
initiated by Ledwina (J Am Stat Assoc 89:1000-1005, 1994) and inspired by the 
Neyman (Skandinavian Aktuarial 20:149-199, 1937) smooth test procedure. A 
Monte Carlo study is provided in order to assess the performance of the test. 

Keywords Mixture models ■ Neyman’s test • Score statistic ■ Schwarz’s criteria. 



1 Introduction 

Mixture models have been widely studied in the last two decades and a usual prob- 
lem concerns the test of the mixing distribution when the mixed distribution is 
known. In particular, testing the presence of mixture; that is, single distribution 
against mixture is largely approached in the literature (see recently Pons, 2008; 
Garel, 2005; Azais, Gassiat, & Mercadier, 2008). 

In this paper, the converse situation is considered: the mixing distribution is 
assumed to be known and the mixed distribution is then tested. For that we consider 
an i.d.d. sample Xi, . . . , X„, with mixture density function 



where IT is a real probability distribution and f{x,m) are real -parameterized 
density functions, for m in some set M C K. We assume that the mixing distribution 
n is known and we want to test 
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Hq : /(x, m) = fo{x, m), for all m in M. 



( 2 ) 



where /o is a specified probability density function. 

This situation can be met with contaminated variables, as for instance X = W + 
Z, 

where W is a known signal and Z is a noise with unknown distribution. We may 
want to test if the error Z is normally distributed, which coincides with a normal 
mixture distribution satisfying (1) with normal mixed density. Also, contaminated 
variables may have the form of a scale mixture: X = WZ. In the same way, within 
the frame of compounded distributions we have X = W\ + ■ ■ ■ + , where N 

is a positive discrete random variable with known distribution II. We may which 
to test if random variables W’s have the same distribution, for instance a Poisson 
distribution. 

For such situations our purpose is to construct a goodness of fit test based upon 
the idea of the Neyman’s smooth test (Neyman, 1937). The basic idea consists in 
transforming the nonparametric hypothesis Hq : f = /o by embedding / in a 
parametric family 



where{Q^; j = 1, . . . , k}isafamily of /o-orthogonal polynomials, 0= (6i, . . . , dt)^ 
is an unknown parameter in an open set containing zero and Q!(f?) is a normalization 
parameter. The parameter k coincides with the number of components that we will 
discuss. Thus the null hypothesis is equivalent to Hq 6 = 0, which suggests the use 
of the score statistic. To work with bounded variables it is usual to transform the data 
by putting T, = (Z, ) for i = !,...,« where Fq denotes the distribution function 

under Hq. Then the null hypothesis is equivalent supposing the uniformity of the 
random variables Ti , . . . , T„ . In that case, associated orthogonal polynomials are the 
Legendre’s ones and the score statistics has the form Tk = \ Qj 

with asymptotic chi-squared distribution under Hq. 

This situation has been studied in Neyman (1937) (see also Rayner & Best, 1989 
for a complete review). In 1994, Ledwina proposed the use of the Schwarz criteria to 
select the number k of components in the score statistics. This author considered the 
statistics Tu, where U = argmaxt<^(7);— A: log(«)), with K bounded. Since, a vast 
literature on data driven Neyman’s tests has been proposed. In particular Kallenberg 
and Ledwina (1995) considered the case of unbounded number of components K. 

The aim of this paper is to adapt the Neyman smooth test and its data driven 
extension in the case of mixture models. However, the Neyman’s test requires either 
the knowledge of the probability distribution function under Hq, or the knowledge 
of orthonormal functions with respect to the probability density function under Hq, 
and in general this information is not valid in the context of mixture. To carry out 
the construction of the test we introduce a known reference probability measure 
fj. such that g and f{.,m), in (1), are density functions with respect to fj.. This 




( 3 ) 
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hypothesis only requires the knowledge of the support of the sample X\, . . . , X„. 
Next we consider a basis of /r-orthogonal polynomials ; j = 1, 2, Writing 
go the probability density function under Hq, the equality (3) yields 

k 

ge{x) = a{9)e,x^(^ejQj{x)}go(x). 

;=i 

Again, the null hypothesis may be rewritten as //q : 0 = 0 and we can adapt the 
score statistic taking into account the non-orthogonality of the polynomials with 
respect to the mixture density go. In the same way we adapt the test statistics ini- 
tiated by Ledwina. We study their distribution properties under Ho and then we 
compare their performances through simulations. 

The paper is organized as follows: In Sect. 2 we proceed with the construc- 
tion of the tests. In Sect. 3, critical values and powers are studied by Monte Carlo 
simulations. 



2 Construction of the Test Statistics 

Let Xi, . . . , X„ be i.i.d. random variables with density function g satisfying (1). If 
the mixing distribution 11 is known we want to test (2) against f{x,m)^ fo{x,m), 
where fo(x, m) are densities with respect to a given probability measure fi. 



2.1 The Score Statistics T(k) 

Let Z{k) = (zi, . . . , Zk)^ be the random vector with components 
Zj ^ -^j2{Qj{X^)-EoiQj{Xm, 

i=i 

where A is a random variable with probability density function g = go satisfy- 
ing Hq. Here, Eq denotes the expectation under Hq. The adapted score statistic we 
propose is 

T(k) = Z(kfZ(k)-^Z(k), 

where Z{k) is the k x k covariance matrix of Z(k) with components 



Z,j{k) = EoiQi(X)Qj(X))-Eo(Qi{X))Eo{Qj{X)). 



We will denote by A(l) > A (2) > > A (A) the sequence of ordered 

eigenvalues of E(A). We first state a basic result 
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Proposition 1. Let X be a random variable with mixture density function g and 
suppose that X has finite moments of order 2k, for some integer k. Assume that 
Ho holds. Then the distribution ofT{k) is asymptotically (when n tends to infinity) 
central chi-squared with k degrees of freedom. 

Proof. Based on the Central Limit Theorem. 

To study the consistency of the test based on T{k) we consider the follow- 
ing class of alternatives: we consider random variables Y with hrst k moments 
satisfying : 

E(e,(J^)) =E(0,(^)), fort = h . . . ,k - l,E{Qt{Y)) ^EiQkiX)), 
for some integer k > 1. 

Proposition 2. The test based on T(k) is consistent against any alternative 
satisfying (4). 

Proof. Based on the Law of Large Numbers. 

Thus, if we let k increase we can obtain consistency against all distributions 
determined by its moments. We consider this problem afterward, by letting k tend 
to inhnity with the sample size n . 



2.2 Schwarz Criteria Statistics 

We discuss a method of data-driven introduced in Ledwina (1994) for selecting the 
number of components k in T(k). The author used the Schwarz’s selection rule 
for choosing exponential families corresponding to successive dimensions in (3). 
The method consists in maximizing the penalized log-likelihood of the i.i.d. sample 
and we can adapt it in our context of mixture distributions. Since the likelihood 
of the mixture dehned by (1) is not easy to express we replace it by its hrst order 
approximation which corresponds to the score statistic. Thus we consider 

Uk = argma\{T{j) - j log(«); j = I,..., K}. 

If we let K = k„ increasing with the sample size n we obtain a general result, 
according to conditions on the speed of convergence. 

Propositions. Assume that k„ log(k„) / X(k„) = o((logn)). Then under Ho, the 
statistics Uk„ converges in probability to 1 and the statistic T (Ukf) is asymptotically 
chi-square distributed; 

Proof. For simplicity we write U instead of Uk„. Under Ho, T(l) converges to a 
chi-squared random variable with one degree of freedom. So, is it enough to show 
that P{U > 2) tends to zero as n tends to inhnity. By dehnition of U , we have 
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P{U >2) = ^P{U = k) < ^P{tI'^ > v'(^-l)log«). 



(4) 



k=2 



k=2 



Using successively Markov and Cauchy-Schwarz inequalities, we get 

Eo(r(A:)V2) 



p(^T(k)^^^ > V(^- < 



^/(k - l)log« 



Eo(||S(^)-V2||2)Eo(||Z(^)|P) 



{k — 1) logn 



(5) 



As previously, we will not distinguish matrix norms and vectors norms. Let us bound 
the right-hand side of (5). We have 



k n 

Eo(||Z(A:)f) = -Eo(^(^(0,(Z,)-Eo(0,(Z))))") 

" y=i ^=1 

= Tri'Lik)) < B < -h 00, 

where B is independent of k since E (k) is a bounded operator. On the other hand, 
we have 

Eo(||S(A:r'/2||2) = ||S(A:r'/2||2 = X{k)-\ 

Combining these last equalities we obtain 



k„ 



B 



(k — l)X{k) logn 



P{U >2)<J2 

k=2 

Since the matrices S(^) are embedding we can deduce that 
P{U > 2) < y 
and by Cauchy-Schwarz inequality 



I B 

A(/c„)log 



k„ , 

-V ^ . 



P(u > 2) < 



B 



1 



1/2 



X(k„)\ogn 



B 



X(k„)\ogn 



A„-i 

\yt=l 

H(k„ - 1)'/V^« - 1 



/k„-\ 



1/2 



k{kn - 1 ) 



Vi:=l } 



where H{kn — 1) is an harmonic sum. From Young (1991) we have H(kn — 1) < 
2{k„ — 1)“* -|- \og{k„ — 1) -1- y, where y is the Euler constant and then P{U > 2) 
tends to zero when n tends to infinity . 
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Remark, since '^{k) is a nuclear operator, X{k) tends to zero as k tends to infinity. 
Conditions on A (A:) of Propositions 3 may be compared with conditions given in the 
literature, as in Cardot, Ferraty, and Sarda (2003) where k‘^X{k) oo when k tends 
to infinity. It may also be assumed that X{k) ~ ar^ , with 0 < r < 1 and a > 0. 
Note that A(A:) ~ ak~'^, with y > 1, is satisfied for the covariance operator of the 
Brownian motion (see Ash & Gardner, 1975). 



3 Simulation Study 



The purpose of this section is to evaluate the performance of the test. All empirical 
values are based on 10,000 samples with sizes n = 20, 30, 50, 100, and the nominal 
level of the test is a = 5%. 

Throughout this section we consider two mixture densities corresponding to the 
following models: 

Model 1: g{x) = J fix, m)Ylidm), where f{x, m) is the density of normal distri- 
bution with mean m and variance 1 , Tt is an exponential distribution with mean 1 . 
Model 2: g(x) = f f{x,m)Ylidm), where fix,m) is the density of Poisson 
distribution with mean m, Tt is an exponential distribution with mean 1. 

Note that Model 2 coincides with a binomial negative distribution with param- 
eters 1 and 1/2. So we will be able to compare our test statistic with that of 
Pearson. 

We first need the following Lemma to express the variance matrix S (k) and then 
the score statistics T{k) for Models 1 and 2. 

Lemma 1. 1. Let {fi.,m);m e K} be the set of normal density with mean 
parametrization m and with common variance a'^ and let the reference measure 
p. be normal with mean 0 and variance The p-orthogonal polynomials are 
Hermite ones (see Abramowitz & Stegun, 1972) with normalization 
f Qjix)^Tidx) = ■ Then we have 



^QiiX)) 



nQi(x)Qjix)) 



j (mYin 

= ^ / 

i-j^s<i+j 



ii\)-^^^U{dm)a'', 



(jfiY 

X^n{dm)E{Q,iY)Qi{Y)QjiY)), 



where Y is p-distributed. 

2. Let {fi.,m)\m > 0} be the set of Poisson density with mean parametriza- 
tion m and let the reference measure p be Poisson with mean mo. The p- 
orthogonal polynomials are Charlier ones ( see Abramowitz & Stegun, 1972 ) with 
normalization ^ Q j (x)^ p(x) = Then we have 
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nQi{X)Qj{X)) = Y^^^^^^^n(dm)nQs{Y)Qi{Y)Qj(Y)), 



where Y is n-distributed. 



3.1 Empirical Levels 

First simulations with T(U) showed that for small sample sizes the percentages of 
rejection of the null hypothesis is larger than 5%. Following Kallenherg and Led- 
wina (1995) we use a second-order approximation to the null distribution to get a 
critical values close to the 5% point. We get: 



For n = 20, 30, 50, 100, inverting A(x) we obtain the following critical values 
5.03, 5.26, 5.36, 5.19, respectively, which are not close to the chi-square 5% point, 
being 3.81. We use these simulated critical values in our simulations. 

We consider tests based on T{2), T{3), T{Y) and T{Us). The results of sim- 
ulations are summarized in Figs. 1 and 2, based on 10,000 simulations. Clearly, 
for sample size greater than 30 all testing procedures have signihcance level of 
approximately 5%. For n = 20 the statistics T (Us) is too much raised. 



P(T(U) < x) {2<p(s/x) - \}{2<p(s/\og{n))} -|- {2<p(-y/- log(«))} 




= A(x). 
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Fig. 1 Empirical levels for Model 1 (o' = 5%) 
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Fig. 3 Empirical powers in percent for Model 1 (normal mixed distribution) against Alternative 1 
(uniform mixed distribution) for a = 5% 



3.2 Empirical Powers when the Mixed Density Is Known 

The proposed tests are performed at the 5% level of significance. We consider the 
following two alternatives: 

Alternative 1: g{x) = f f{x,m)Yl{dm), where f(x,m) is the density of uni- 
form distribution with mean m and variance 1, fl is an exponential distribution with 
mean 1. 

Alternative 2: g(x) = f f(x,m)n(dm), where f{x,m) is the density of a 
geometric distribution with mean m, fl is an exponential distribution with mean 1. 

Figure 3 shows the estimated powers for Model 1 with Alternative 1 and Fig. 4 
shows the estimated powers for Model 2 with Alternative 2. 
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— T4 
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Fig. 4 Empirical powers in percent for Model 2 (Poisson mixed distribution) against Alternative 
2 (geometric mixed distribution) for a = 5% 



It seems that all the statistics have the same good power, except T4 when the 
size is small because too much polynomials estimations are then needed with few 
observations. The statistics of Pearson is used with three or four degrees of freedom 
(according to the sample size). It is less powerful here than the other statistics. 
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Classification with a Mixture Model Having 
an Increasing Number of Components 



Odile Pons 



Abstract This paper concerned is with estimation of the components and classifi- 
cation in semi-parametric mixture models with increasing number of components as 
the sample size grows. Properties of the penalized maximum likelihood estimators 
are presented: consistency, rates of convergence and asymptotic normality, under 
additional assumptions. A random classification of the observations is based on the 
same criterium and some consistency properties are established. 

Keywords Asymptotic behaviour • Classification • Maximum likelihood ■ Mixture. 



1 Introduction 

Mixture models are often used when modeling heterogeneity within a population, 
where p distinct homogeneous sub-populations are described by sub-densities. The 
density of a random variable defined on the population is written as a weighted mean 
of the densities of the sub-populations. Inference in finite mixture models has been 
studied by many authors (see, e.g., Bickel & Chernoff, 1993; Fan & Peng, 2004; 
Lemdani & Pons, 1999 and the references therein). 

In this paper the mixture densities have a number p„ of components that grows 
with the sample size n at an appropriate rate. In a population of infinite and count- 
able size, an increasing number of sub-populations is sampled as the number of 
observations increases. As the mixture coefficients sum up to one, they are indexed 
by n. The problem of estimating and classifying within a countable set of densities 
is different from the usual finite classification and it appears in many applications 
(Pons, 2007; Pons & Petit, 1996). 

When available, a large sample allows to differentiate between close sub- 
populations which a reduced sample may confound. The densities are then also 
indexed by n. However our aim to detect as many components as possible has to be 
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balanced by the need to avoid false detections. For that purpose we used penalization 
functions designed to get not too small coefficient estimators and component den- 
sity estimators not too close to each other in the L 2 -distance. The reader is referred 
to Fan and Peng (2004) and the references therein for more details on penalization 
in a general framework. In our approach we use similar regularity assumptions and 
extend some of these results to mixtures. 

Let , A. lP)pg 73 be a family of probability spaces and a true probability Pq in 
V and let {X, S) be a metric space. Consider a random variable 



Pn 

7=1 

with values in X having a density g„ under an unspecified probability P„ and go 
under Pq and a sample Xi, . . . , X„ of Z. The densities g„ of the model are written 



Pn 

Sn = ^ ( 1 ) 

7=1 

with a varying number of components p„. The mixture probabilities are 

/X„,y = Pr(ft) G A„j), f f„j{s)ds = Pr(Z„j < x), 

J—OO 

and go = ^j"=i dn j fn j- The Hnj’& are in [0, 1] and sum to 1 and the densi- 
ties fn.j are supposed to belong to a nonparametric space of probability densi- 
ties F. This space is assumed to satisfy the identifiability assumption D1 given 
below. The model is semi-parametric with scalar and functional parameters P„ = 

id'll, T d'U,p^ ft,U • • • J ft,p) ■ 

In what follows the n -sample is supposed to have the density g„ = gp^ with a 
slowly increasing number of components: p„ = o{n) as n tends to infinity. As the 
mean number of observations within the sub-populations also increases with n, the 
parameters may be suitably estimated. 

We first address the problem of estimating fin. Under (1), we consider the 
maximum likelihood estimators (MLE) of the and fi.j’s. Notations and 

assumptions are stated in Sect. 2. Choosing appropriate respective penalization func- 
tions we establish that the convergence rate of ||/x„|| is and it is related 

to the density estimation procedure for \\fn\\, the proofs are in Pons (2009). The 
dimension p„ is supposed to be large enough to describe the true model under Pq, 
with dimension r„ smaller than p„ . Under stronger conditions on the penalization 
functions, the non zero components completely disappear in probability and the 
parameters are asymptotically Gaussian. 

For any integer I, let us denote Si = {fi € [0, 1]^, dj — U for th® i^ ~ 
l)-dimensional simplex and Sf = Si 1]^ for subset of strictly positive 

components, Vi stands for the set of permutations of {1, . . . , £}. The notation || • || 
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is for the Euclidean norm and the L 2 -norm with respect to Lebesgue’s measure 
on F and other functional spaces when it is specified. Moreover, for any vector of 
functions v = (vi, . . . , vi)^ in L 2 (A’, IR) and the L 2 norm is ||v|p = ll^i IP- 
Finally, vector notations are extended in order to define the sum of two vectors 
of respective dimensions r„ and r„ + 5 ,, by adding s„ zeros to the first one when 
necessary. 



2 Main Notations and Assumptions 

Let be the actual value of fin and g° = gpo be the underlying density. 

We make the assumption that the model is never under-parameterized which means 
that the number of components in g^ with nonzero coefficients and pairwise distinct 
corresponding densities is less than or equal to the number p„ of model ( 1 ) consid- 
ered in the estimation procedure. Therefore a number s„ of components, among p „ , 
may not appear in g^ which is equivalent to having coefficients equal to zero. Then 
Pii = f„ + s„ and we suppose that r„ increases with n. Up to a permutation, we can 

write = (p.%, , 0 , ..., 0 )^ also denoted with 7 ^ 0 for 

I < j < r„ but cannot be identifiable under ( 1 ) as a 2 />„ -dimensional param- 
eter since there are no densities in F corresponding to its 0s„ sub-vector. Then the 
parameter is denoted , fn^Y with/® e F''". 

In order to estimate the component densities we suppose that a functional esti- 
mation procedure has been chosen (kernel-based, splines, . . . ). This means that any 
density / e F has an estimator f„ chosen within a space F„ and its expectation 
belongs to a sequence of spaces F„ C F such that 

limsupF,, = F. 



This sequence allows for approximating the density / by its projection in F„, 
at some rate (bias term) for c e ( 0 , 1 / 2 ] whereas the variance term (which 
amounts to a “distance” between F„ and F„) has the same rate 



{V) sup inf Wfn- f\\ = 0{n (2) 

/€F/n€F« 



sup inf 

fn 




Opo(«“^). 



(3) 



Tangent spaces associated to the ’s are defined for any 1 < / < r„ and 
/ G F \ {f^ j } by the derivatives 



h 



j.f = 



f - 

J Jn,j 



g 



0 

fl 



(4) 
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VjJ 



f - fn, 

hj.fg°„ 



For 1 < 7 < denote D„ jF = {'Pj.fl f ^ F}- Then we define the L 2 - 
differentiability of F at j by the existence of a linear map j e D„ j F such that 



limsup hj^Wigl) \f - llj)-hjj(p 



0 

n.j 






= 0 . 



We define covariance matrices, for v„,y e D„ jF, 1 <7 < r„, as 

F(vi, . . . , v,„) = Eo{(g°)-2(vi, . . . , 

where = xx'^ , and F° = F(/°, . . . , f°J. 

For any P„ e x F,^” the log-likelihood is 

n { Pn 1 

L„ (p„) = L„ {p„, , . . . , i^„) = ^ log j ^ IX, j f„j (W) I . 

Consider penalization functions jti : [ 0 , 1 ] — s- F+ and Jt2 '■ L2(A’,R) — > R+ 
associated to the coefficients ix„ j and f,.j, respectively, with jri(O) and :7r2(0) = 0. 
We suppose ji\ is twice continuously differentiable and Jt2 is twice continuously 
L2-differentiable, which means the finiteness of the L2(T’, R)-derived norms for 
n'2 and 712 . They are defined by 

Ik2(/)ll = sup \\n' 2 {f)h\\, ( 5 ) 

/i6Z.2(ip).I|/i|I=i 

Ik2(/)ll = sup \\}i^ Jt2{f)h\\, (6) 

/i€L2(ip),I|/!||=i 



for any / eF 2 (T’,R). 

Then the penalized log-likelihood is defined as 



Pn 

Qnifin) = Lnifin) ~ tl^l ^ Tti {lX„j) - nV^ ^ 7T2 {fn.J ~ fn.k) , 

7 = 1 ^<j<k<Pn 

where X„ and u„ are regularization parameters tending to zero as n tends to infinity. 

In what follows = {jiin, . . . , [Xp„„, fin, ■ ■ ■ , fp„„Y = (Al- //)^ denotes the 
penalized MLE of over Sp^ x F^' ■ 

In the conditions below, D refers to conditions on the density family and P is 
related to penalization functions. The rates of assumptions P vary according to the 
results below, with P' designing the stronger ones. 
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Condition 1 D\. Identifiability of the components: 

For every I > 1, (/^i, . . . , e Sf e Si, {f\, . . . , fi) and 

{hi , . . . , hi) € such that j ^ j' implies f„j ^ fj', if 

i i 

l^n.J fij = ^n.jhn.j , 

i=l i=\ 

then 

3cr ^ Fi, s.t. l^nj — ^cf(j) ^tid fyij — 
for every j, 1 < 7 < 1. 

D2. The density is uniformly bounded over n, 

E sup sup max ||g“*/n,7 < 00 

{/» = (/] /p„)€FP"} {j„=/rJ/„,/r„€5^„,||/r„-/rO||=o(l)} 

and 



lim sup sup l^n ‘ log (g°) * ^ /« J (^i) 



7=1 



Pn 



-£olog{(g°)-' ^/x„, 7 /„. 7 }(X)| = 0. 
7 = 1 



D3. The n observations X\, . . . ,X„ are Ltd. with density 

D4. r° is positive definite and the smallest eigenvalue £«(■) ofTfi) satisfies the 
uniform property 



lim inf inf s„{vi, . . . , v,) > 0 
te“)-i(vi,...,v,„)ew 



where U = £>iF x . . . x Dr„JP. 
PI. The sequences 



a 



n 



max 

!<7<r„ 




and A„ 



max 

i<j<k<r„ 




satisfy 

Xian = 0 and = O {p„n^‘^) . 

or P'\. For converging neighborhoods Vn{pl) of pt^ in Sp,, and yVnif^) of f° 
in F'^", the sequences 
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a' = sup max \p'^ (/x„,y) |, 

(/^l Atp„)€V„{M°) 

K= sup lUUX \\n' {f„j - fn.k) 

(/l,..../p„)€W„(/«)Xp» ^<J<k^Pr, 



satisfy 

= 0 and v^A'^ = 0 ip„n^") . 

P2. The sequences 



b 



n 



max 

l<;<rn 




and Bn 



max 




satisfy 

= o(l). 

or P'2. The sequences 



b'n = sup max \p" [Xj)\, 

(Ai....,Ap„)6F„(A0)'2>2?» 

K= sup max \\n"{fj-fk) 



satisfy 

^IK + vf,B'n = o{pf^^^). 

P3 The functions Jt'( and n'f are Lipschitz continuous and :7r(_|_(0) > 0. 

Remark 1. Parametric families are trivial examples of density spaces satisfying 
assumptions D1-D2 under relevant constraints on the parameter space and the 
^-i/2-rate is achieved for the optimal density estimators. On the other hand clas- 
sical nonparametric families of densities (unimodal, compact support, symmetric, 
bounded variation, . . . ) are larger but non identifiable in the sense of D1 . However 
the results are suitable for nonparametric families of densities with constraints of 
nonparametric shape. Estimators may then be defined by r„ -dimensional projections 
on a basis of functions and the best rate for an approximation of the true functions 
is fixed by {T>), with r„ — s- oo as n increases and r„ depending on c according to 
the basis. For given penalization functions Ji\ and ji 2 , conditions P1-P2 or P'l-P'2 
give the respective convergence rates of p„, X„ and v„ with respect to n and c. 



3 Convergence 

Our first result states the consistency and the weak convergence of the estimators of 
jjLn and fn that maximize the penalized log-likelihood, it is proved in Pons (2009). 
First, under D2, D3 and P1-P3, there exists a function K such that 
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- QniPn)] + K{Pn,Pt)\ ^ 0 . 

p„ 



Under D1 and by the concavity property of K, the maximum of Q„ is achieved in a 
neighborhood of with probability tending to 1. 

Theorem 1. Under {T>), D1-D4 and P1-P3, the penalized m.l. estimator = 
(A« ’ fl)^ maximizing n~^ {Qn(fi^) — QniPn)} exists and satisfies 

\\Pn - PnW = \\f„ - f°\\ = 

Remark 2. The conditions of Theorem 1 are satished for :tti = 0 and jt 2 = 0. 
Therefore the MLE estimators without penalization have the convergence rates of 
Theorem 1. 

The asymptotic distribution of the estimators is given under stronger conditions 
and additional notations. Let 1„ denote the vector of IR'" with all components equal 
to 1 and / the r„ x r„ identity matrix. A vector is said to be op„(l„) if all its com- 
ponents are op„(l) while a matrix is Opo(7) if all its columns are Opo(l„). Finally, 
two vectors are said to he asymptotically equivalent if their difference is 
For /(„) = (/i, . . . , fr„Y G F''" we denote ^ f„j, with 

and 



Vnifin)) = , frJXO f, 

i 

WAfHn), fKr,)) = 

i 

with /i(„) and / 2 („) e F. Then WnifT’ , f^) asymptotically equivalent to the 
matrix 

Wo.„ = . 

Theorem 2. Under Hypotheses {T>), D1-D4, P' 1, P'2, P3 and if p„ = o(n^) 

. . IPo 

®^o(Mrrt + 1.7i ? ' ' ' t^pn.n =0) > 1. 

71—^00 

The first r„ components satisfy 

\ / \<j <r,i 

+ OPo(l„) 



If moreover p„ = o{n‘^l^), we have 
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,1/2 



(/2n.y 

\ ■“ / l<] <r„ 



Wr 



-1 



{Ki/l 






4 Random Classification of the Observations 

The classification by a mixture model is the allocation in a component of the mixture 
or in a subset of its components, with a random procedure for the choice of a single 
component. The number r„ of strictly positive components of the mixture model is 
deduced from Theorem 2 or from a test presented in Pons (2009). The knowledge 
of r„ allows to determine which class an observation X, belongs to. A classification 
consists in mapping an observation A, into a class k„(Xi ) in {1, . . . , r,,}. Generally, 
the supports of the densities are not disjoint and a value x may be assigned to sev- 
eral classes when it lies in the support of several densities, the class may then be 
estimated according to a random distribution. The estimated component k„ may be 
chosen by maximum likelihood with the same penalization as previously. Let 



Pn 

K„{fi„.kfn,k',x) = Jl„,kfn,k{x) - nXl'^JTi 

;=i 

-vl n2{fn,k- fn.l)- 

l<l<rn 



The classification of a variable value A, in X is defined by the estimation of the 
indices of the classes enclosing A,- 

^„(Z,) = arg max K„{}lk,Jk„\Xi) 



if the index is uniquely defined and otherwise by a set of indices with the same 
maxima provided with probabilities. For 1 < / < r„, let 



kn{Xi) = l 



with the probability defined with respect to the dominating measure as 

P„{X e Cn.i\X = Xi) = %,Jin{Xi)Z;'Xi)- 



A random classification avoids misclassification of the observations with over- 
lapping densities and the definition of classes with lower estimated probabilities 
than in the mixture density, as explained in the next proposition. Let k{X) denote 
the true sub-population of an observation A, in the mixture model. 
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Under conditions of Sect. 2 and if p„ = 

- fnj) = 0( X! - fnA 



Kl<r 



Kl<r 



therefore the main term of Kn{jln,k fn,k \ x) is ^„,kfn,kix). 
Proposition 1. For every X[ in X and I t« { 1 , . . . , r„}. 






Po 



ix,)=i) 



i = l 



if Pn = o(n) and?x{kn{Xi) ^ k{Xi)} = o(l). 

Proposition 2. Ifr„ = o{n‘^iA> then for every m > 1 

kn p 

C ^{max Wfi - fj r^K / i f, - f,r dpi + oin-E)} 



l=\ 



xmin{f g AXifi - Xjfjf dx + o(l)}. 
Jrl J 
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Nonparametric Fine Tuning of Mixtures: 
Application to Non-Life Insurance Claims 
Distribution Estimation 



Laure Sardet and Valentin Patilea 



Abstract When pricing a specific insurance premium, actuary needs to evaluate the 
claims cost distribution for the warranty. Traditional actuarial methods use paramet- 
ric specifications to model claims distribution, like lognormal, Weibull and Pareto 
laws. Mixtures of such distributions allow to improve the flexibility of the para- 
metric approach and seem to be quite well-adapted to capture the skewness, the 
long tails as well as the unobserved heterogeneity among the claims. In this paper, 
instead of looking for a finely tuned mixture with many components, we choose a 
parsimonious mixture modeling, typically a two or three-component mixture. Next, 
we use the mixture cumulative distribution function (CDF) to transform data into the 
unit interval where we apply a beta-kernel smoothing procedure. A bandwidth rule 
adapted to our methodology is proposed. Finally, the beta-kernel density estimate 
is back-transformed to recover an estimate of the original claims density. The beta- 
kernel smoothing provides an automatic fine-tuning of the parsimonious mixture 
and thus avoids inference in more complex mixture models with many parameters. 
We investigate the empirical performance of the new method in the estimation of the 
quantiles with simulated nonnegative data and the quantiles of the individual claims 
distribution in a non-life insurance application. 

Keywords Beta kernel • Data transformation • Mixture model ■ Non-life insurance. 



1 Introduction 

Setting an insurance premium requires to evaluate the amount of future claim pay- 
ments for a defined period. It has been commonly agreed that the elements that need 
to be evaluated in order to determine insurance premiums are severity, frequency of 
claims and future trends. In practice an insurance premium in a non-life contract is 
based on the combination of both cost and frequency distributions and includes a 
margin to account for risk volatility. 
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Here we focus on claim costs distribution estimation for non-life products. The 
main difference between life and non-life insurance products is that in the hrst case 
the claim costs are hxed by the contract while in the non-life case the costs are 
random. Moreover, previous experience indicates that the distribution of non-life 
insurance claims is strongly skewed due to few high costs. 

Insurance is a more and more competitive market, so insurers need to robustly 
and accurately estimate their risk to be able to offer lower prices with acceptable 
potential losses. Additionally, the European Commission (EC) is currently work- 
ing on a regulatory project (Solvency 2) which shall dehne the major guidelines to 
be followed by all the European insurance companies in terms of capital require- 
ment. The final EC directive (to be applied starting 2012) will set principles that 
companies should use to estimate their capital requirement. In this directive, the 
value-at-risk (VaR) of the global risk will be recommended to estimate the mini- 
mal capital requirement. For all these reasons, insurance companies are focusing 
on the upper quantiles of their costs distributions. In this paper we consider the 
individual claim distribution, which is the hrst level of estimation in premium 
calculations. 

In many applications, the statistical modeling based on classical (parametric) 
severity distributions like lognormal, gamma or Pareto laws is inadequate. Classi- 
cal kernel smoothing, like presented in Silverman (1994), could be used to derive 
better claims distribution estimates, see Bolance, Guillen, and Nielsen (2003). 
However, the classical kernel smoothing based on symmetric kernels may be of 
limited value with data in [0, oo), and therefore asymmetric kernel-based esti- 
mators may be recommended, see Bouezmarni and Scaillet (2005). Below, we 
propose a related nonparametric methodology starting from mixtures of simple 
distributions. 

First, let us remember that mixing a hnite number of simple distributions is an 
alternative way to increase the accuracy of classical parametric models. For illus- 
tration, we considered a real data set of 5,635 extended warranty claims observed 
on an non-life insurance portfolio. See Sect. 4 for a more detailed description of the 
data. In Fig. 1 we present the probability-probability (P-P) plots of the lognormal 
(left picture) and the two-component lognormal mixture obtained with this data set. 
The parameter of the simple lognormal ht as well as the hve parameters of the mix- 
ture were estimated by maximum likelihood. The estimated lognormal distribution 
drastically fails to ht the data: it overestimates the left tail of the claim distribu- 
tion and underestimated its right tail. This could be explained by the coexistence 
of standard and high claims which a single lognormal distribution seems unable to 
capture. The mixture accounts for heterogeneity and thus allows for a much better 
ht, but the model still seems misspecihed and a more complex mixture involving 
more components shall be considered. However, it is well-known that in general 
the statistical inference of mixtures with many components could be a difficult task 
because, on one hand, more parameters are to be estimated and, on the other hand, 
a statistical test for the number of components could be necessary (e.g., Mclachlan 
& Peel, 2000; Titterington, Smith, & Makov, 1985). 
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Fig. 1 Probability-probability (P-P) plots of the lognormal {left) and two-component mixture of 
lognormal distributions (right) 



A main interest of the finite mixture modeling is the possibility to identify and 
estimate a finite number of latent classes and thus to explain and interpret the 
observed heterogeneity. However, in the applications we have in mind, one only 
looks for an accurate approximation of the data distribution. For this reason, instead 
of proceeding further with more complex mixture modeling, we propose to fit a 
parsimonious mixture with a small number of components. Next, the mixture cumu- 
lative distribution function (CDF) is used to transform data on the unit interval. 
Then, a beta-kernel based smoothing procedure (see Chen, 1999) is used to estimate 
the density of the transformed data, from which the estimate of the density of the 
original data is easily derived by a change of variables formula. Finally, the estimate 
of the CDF of the original data is obtained by integration of the density estimate. 
In this way, the parsimonious mixture with a small number of components is finely 
tuned by a nonparametric procedure which requires only one additional parameter, 
the bandwidth. 

A similar procedure is proposed in Gustafsson, Hagmann, Nielsen, and Scaillet 
(2009) (see also Charpentier & Oulidi, 2009), where data is transformed by some 
generalized Champernowne CDF estimated from data. Let us point out that an 
important feature the data transformation should have is to produce transformed 
observations well spread on the whole unit interval. If the transformed data is con- 
centrated on some proper subinterval of [0, I], bandwidth rules based on global 
criteria like the ones considered in Gustafsson et al. (2009) and Chen (1999) will 
probably fail to yield an accurate estimate of the transformed data density. We argue 
that, contrary to the families of transformations defined by a fixed small number 
of parameters, mixture CDFs provide a flexible family of transformations that can 
be adapted to any application by tuning the number of components. Our empir- 
ical experience with non-life claims indicates that quite often it suffices to start 
from a two or three-component mixture of lognormals, that is from mixture models 
with hve or seven parameters. However, more complex transformations might be 
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necessary in other applications and mixture-based transformations may be adapted 
for the required flexibility. 

The paper is organized as follows. In Sect. 2 we introduce our mixture-based fam- 
ily of transformation which transform the original data on the unit interval. Given 
a nonparametric estimate of the transformed data density, the density estimator of 
the original data is obtained by a simple change of variables formula. In Sect. 3 
we recall the beta kernel estimators introduced by Chen (see Chen, 1999) and we 
propose a new bandwidth rule. In Sect. 4 we apply our nonparametric methodology 
with simulated data and real data on non-life insurance claims. 



2 Mixture-Based Data Transformations 

The classical hnite mixture model is a set of finite convex combinations of simple 
parametric distributions. The intuitive interpretation of such a model is the presence 
of different latent (unobserved) subpopulations with specific behaviors. Each com- 
ponent of the model corresponds to a specific subpopulation. Mixtures could be used 
as an unsupervised clustering method or to provide a flexible estimation procedure 
for the marginal distribution of the data. See Titterington et al. (1985) or Mclachlan 
and Peel (2000) for comprehensive reviews of mixture models. 

In a more formal description, let {G(-; 0) : 0 e 0} be a parametric family of 
CDFs on the real line. A random variable X has a finite mixture distribution of k 
components from this parametric family if the CDF G(x) = P(A < x) of A can be 
written as 



G(x) = jriG(x',9i) + ■■■ + 7tkG{x;0k), x e M, jti > 0, itj = 1 , 



where G(-; 0, ) is the CDF of the ith component and jti the corresponding mixture 
weight. By definition, the mixture is identifiable if for any 



the equality G(-) = G(-) implies k = k and {ji\, 0i), . . . , {jik, 9k) are a permutation 
of (ifi, 0i), . . . , (if^, %). 



Here we focus on mixtures of two-parameters lognormal distributions 
which represent important models for claims distribution (e.g., Sarabia, Castillo, 
Gomez-Deniz, & Vazquez-Polo, 2005 and the references therein). We say that a pos- 
itive random variable Z follows a lognormal LN{jji,a^) distribution, with /x e M 
and a > 0, if the random variable ln(Z) is normal with mean /x and variance a^, 
that means the CDF of Z is 



k 



i = l 



k 



G(x) = ifiG(x; 0i) -|- ■ ■ ■ -h if^G(x; 0^), x e K, if,- > 0, 
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where <!>(■) is the standard normal CDF. Finite mixtures of lognormal distribu- 
tions are identifiable (e.g., Atienza, Garcia-Heras, & Munoz-Pichardo, 2006). Given 
an independent identically distributed sample Xi, X2, ■ ■ ■ X„ of X, the parame- 
ters of the mixture can be estimated by maximum likelihood. For maximizing the 
likelihood, the popular EM algorithm could be used (e.g., Titterington et al., 1985). 

Let G(-) be the parsimonious (small number of components) lognormals mixture 
CDF estimated with a sample of claims Xi, X2, . . . X„. Next, build the sample of 
transformed observations 



Yi = G(Xi), I <i <n, 

which lay on the unit interval [0, 1]. Data transformation is frequently used in non- 
parametric statistics (e.g., Devroye & Gyorfi, 1985; Marron & Ruppert, 1994; Wand, 
Marron, & Ruppert, 1991). The transformations we consider are based on mixture 
models and could be made arbitrarily flexible by increasing the number of mix- 
ture components. However, in practice, one has to find a trade-off which avoids 
inference in complex mixture models and meanwhile allows for sufficiently flexi- 
ble transformalions. For simplicify, in the empirical experiments presented below 
we consider mixtures of two lognormals, that is the mixture model parameters are 
61 = (/r-i.CT^), 62 = ifi2,0'2) and jti. Such parsimonious mixtures yields suffi- 
ciently flexible transformations and transformed data well spread on the whole unit 
interval for the applications we consider. 

Given / (■) a nonparametric estimate of the density of the transformed data, we 
propose to estimate the density of the original claims by 

g{x) = fiG{x))G'(x), x>0, (1) 

where G'(-) is the estimated two-component lognormals mixture density. The quan- 
tity / (G(-)) represents the nonparametric fine tuning factor of the parsimonious 
mixture density G'(-). In practice, given the estimated mixture parameters, G(-) and 
G'(-) can be easily computed using the density and the CDF of the standard nor- 
mal distribution implemented in any standard software. For the estimation of the 
transformed data density we use a beta-kernel approach described in the following 
section. 



3 Beta Kernel Density Estimation 

As a remedy to the boundary bias of the classical symmetric kernel smoothing, Chen 
(see Chen, 1999) proposed two types of beta-kernel estimators. Given a sample 

Yi ,Y„ e [0, 1], Chen’s first nonparametric estimator of the density / (■) of T, is 

defined as 
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n 

f\{y) = n~^ XI ^ylh+\.(i-y)lb+\(yi), 

i=i 

where ^ is a smoothing parameter and Kpq(-) is the density of a beta random 
variable with parameters p,q > 0, that is 



Kp,q{t) 



/o' 



t e (0, 1). 



Let us recall the following two properties of the beta kernel estimators pointed out 
by Chen: (1) these estimators are, in some sense, adaptive density estimators (the 
amount of smoothing being changed according to the position where the density 
estimation is made without explicitly changing the bandwidth); and (2) the support 
of the estimator matches the support of the data density. 

In order to reduce the bias of f\ , Chen proposed a second beta kernel estimator 



ky) = n-^Y.KbiYi)^ 

i = \ 



where K* ^ are boundary modified beta kernels defined as 



Kb 



K 



y_ (1— v) (?) 
b ’ b 






ify e [0\lb), 
if y e \lb', 1 — 2b], 
ify e {l-2b\ 1], 



where p{y, b) = 2b^ + 2.5 — ^/Ab^ + 6b^ + 2.25 — y^ — y lb. 

The biases of the two beta kernel estimators, as presented in Chen (1999), are 

Bias |/i(y)| = {(1 - 2y)f'{y) + (\/2)by{\ - y)f”{y)}b + o{b) 

and 

U(y)bf'{y) + o(b) ifye[0;2(?). 

Bias |/ 2 (y)| = 1 \y{\ - y)f”{y)b + 0{b^) if y e [2b-, 1 - 2b], 

[-^1 - y)bf'{y) + 0 {b) if y G (1 - 2b-, 1], 

where ^(y) = (1 — y){p(y) — y /b}/{\ + bp{y) — y}. Meanwhile, for any fixed 
y G (0, 1), the variance of /i(y) and / 2 (y) is 

1 



2V)t {y(l -y)}'/^ 



{/(y) + 0(n-')}. 
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See Sect. 3 in Chen ( 1 999) for the behavior of the variance when y jh and {\ — y)jh 
converge to some finite constant. 

Given bias and variance formulae and following the usual bias-variance trade-off 
criterion, we propose the following global bandwidth for the estimator f\ 

[s);/rU(l-r)rG(vWy]* 

h\ = J 5, 

4i [/; ^ {(1 - ly)f'{y) + i j(l - y)f"{y)fdy] ' 

where e is some small positive constant. On similar grounds, for the estimator fx 
we propose the global bandwidth 



[^f;~'iy('-y>rh(y)dyf , 

b*2 = — 5 . 

[ir'{y(^-y)f"(y)fdyY 

When e = 0 in the definition of h* and above, we recover the optimal 
bandwidths proposed by Chen (see Chen, 1999, p. 137) which were obtained by 
minimizing the leading terms of the MISE for f\ and /a, respectively. Our band- 
widths could be derived as optimal bandwidths for the mean integrated squared 
errors for f\ (y), respectively /a, on the interval [e, 1 — e]. The rationale for reducing 
the integration interval is to avoid the intervals very close to the boundary where 
the factors like y“ and (1 — y)* appearing in the integrands with a,b < 0 may 
produce numerical instability. It may even happen that the integrals of the whole 
interval [0, 1] do not exist, that is it may even happen that the optimal bandwidths of 
Chen are not defined while our bandwidths are always well defined. In our empiri- 
cal experiments with transformed claims data (real and simulated) we take s = 0.01 
and this produces satisfactory results. A related idea of reducing the integration 
interval when defining the global criterion for choosing the bandwidth was proposed 
in Bouezmarni and Scaillet (2005). 

In order to compute our bandwidths b* and b^ we still need preliminary esti- 
mators of the true density /(■) and its first two derivatives. Following the spirit of 
the rule of thumb proposed by Silverman in the classical gaussian kernel smoothing 
(see Silverman, 1994), we consider a simple beta density fit and the corresponding 
derivatives as preliminary estimates of /(■), f'i ) and /"(■)■ To estimate the two 
parameters, say a,p, of the simple beta fit, we use the method of moments with 
the first two moments of the beta distribution. This produces the following simple 
estimators 






OT 1 ( 1 — mi) 

m2 



(1 -mi). 



a = P 



mi 

\ — mi" 
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where m \ and m 2 are the empirical mean and variance, respectively. The beta density 
with parameters fi and a and its first two derivatives are then plugged in the formulae 
of b\ and in place of the true /(■), f'{-) and /"(•). Finally, the integrals on 
[e, 1 — e] are computed by numeric integration with the trapezoidal rule. 



4 Application to Non-Life Insurance Data 

Here we investigate the empirical performances of the new smoothing method with 
real and simulated data. For all the empirical investigations we use SAS/IML rou- 
tines. In the simulation experiment, we consider 1,000 samples of n = 1,000 
independent observations from a three-component lognormal mixture 



7i\LN{pL\,al) -t- ji2LN{iX2,ol) + 7i2)LN{iX2,,al), 



with TT\ = 0.59, tt 2 = 0.35, /Xi = 5.37, 112 = 5.31, 112 = 4.08, cti = 0.92, 
02 = 0.36 and 02 = 0.35. The values of the parameters were fixed similar to val- 
ues obtained when fitting lognormal mixtures to real data sets of extended warranty 
claims. For each simulated sample we estimate a mixture of two lognormal dis- 
tributions using the EM algorithm. The initial points for the five parameters to be 
estimated are generated randomly in a large range in (0, 1) x x Only the 
values of the parameters yielding a two-component mixture with the mean and vari- 
ance similar to the corresponding moments computed in the simulated sample were 
retained as initial points for the EM algorithm. 

We transform data using both the two-component mixture and the Champer- 
nowne CDEs. The Champernowne CDE has the form 



^a.M,c{y) 



{y -h c)“ — c“ 

{y + c)“ + (M + c)“ - 2c“ ’ 



if y > 0, 



Fa.M.c (y) = 0 Otherwise, with parameters a > 0, M > 0 and c > 0. To estimate the 
three parameters of the Champernowne CDE, we first compute the empirical median 
M and then choose the parameters a and c by maximizing the (profile) likelihood 
obtained after replacing M hy M (see Gustafsson et al., 2009 and the references 
therein). 

With the transformed data at hand, we apply the two beta-kernel estimators intro- 
duced by Chen with the bandwidth rules b* and b^ described in Sect. 3 to estimate 
the density of the transformed data. Einally, using the change of variables formula 
(1) and the trapezoidal rule for numerical integration, we estimate the 99th quantile 
of the original data distribution. 

First, we investigate the impact of our bandwidth rules based on the MISE 
reduced to the interval [e, 1 — e] compared to the classical ones proposed by Chen. 
The results are summarized by the box plots presented in Eig. 2. We notice that 
our corrected bandwidth rule slightly improves the performance of the quantile 
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Fig. 2 Box plots of the 99th quantile estimation obtained with two-component lognormal mixture 
CDF transformation and the beta-kernel estimators f\ {left picture) and /2 {right picture): Chen’s 
bandwidth rules {left plots) vs. h* and obtained with e = 0.01 {right plots). The true 99th 
quantile value is given by the horizontal line 




method 

Fig. 3 Box plots of the 99th quantile estimation obtained with two-component lognormal mixture 
CDF transformation {left plots) and Champemowne CDF transformation {right plots), the beta- 
kernel estimators /j and (2 and bandwidths b* and AJ obtained with e = 0.01. The true 99th 
quantile corresponds to the horizontal line 



estimators with both beta-kernel estimators. It seems that reducing the interval 
for bandwidth calculations in order to avoid numerical instability is a satisfactory 
alternative to Chen’s bandwidth rules. 

Next, we compare our mixture CDF transformation method to the Champer- 
nowne CDF transformation method proposed by Gustafsson et al. (2009) using the 
same simulated data. Our bandwidth rule is applied to both cases. The results are 
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Table 1 Transformation-based nonparametric quantile estimation with real data on non-life 
insurance claims (5,635 observations) 



Transformation CDF 


Beta-kernel 


Mean 


<?50 


?70 


?75 


195 


<?97 


Mix2LN 


Chenl 


88.93 


78.79 


82.83 


87.25 


129.20 


137.20 


Mix2LN 


Chen2 


88.73 


79.12 


84.39 


90.01 


145.67 


164.87 


Champernowne 


Chenl 


88.93 


79.64 


89.20 


95.34 


149.56 


182.47 


Champernowne 


Chen2 


86.91 


80.30 


91.55 


97.57 


148.90 


175.54 



presented in Fig. 3. With both beta-kernel estimators, the estimates of the 99th quan- 
tile obtained using the mixture CDF transformation are much better. The method 
based on the Champernowne CDF significantly overestimates the 99th quantile. 
We explain this by the higher flexibility of the mixture-based transformation which 
capture the right tail of the original data and thus avoid the transformation of the 
large clams on a very tiny subinterval of (0, 1) on which the beta-kernel smoother is 
unable to provide reliable estimates. 

Finally, we apply our new smoothing method with real data. Historical data used 
contain individual claim costs paid for a good protection insurance of a specific 
segment from 2002 to 2005 (5,635 observations). The observations are multiplied 
by an inflating factor such that the costs are as if all claims occurred at the end 
of 2005. Table 1 contains the quantile estimates obtained with this real data set. 
For each transformation, the results obtained with f\ or (2 become significantly 
different for high quantiles which seems normal in view of the definition of / 2 . 
The 97th quantile estimates obtained with the Champernowne CDF transformation 
are significantly larger than those obtained with the mixture CDF transformation. 
The previous simulation experiment suggests that our method yields estimates with 
a much smaller bias, but of course this has to be further confirmed by extensive 
empirical work. 
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Part IV 

Linguistics and Text Analysis 



Classification of Text Processing Components: 
The Tesla Role System 

Jiirgen Hermes and Stephan Schwiebert 



Abstract The modeling of component interactions represents a major challenge 
in designing component systems. In most cases, the components in such systems 
interact via the results they produce. This approach results in two conflicting 
requirements that have to be satisfied. On the one hand, the interfaces between the 
components are subject to exact specifications. On the other hand, however, the 
component interfaces should not be excessively restricted as this might require 
the data produced by the components to be converted into the system’s data format. 
This might pose certain difficulties if complex data types (e.g., graphs or matrices) 
have to be stored as they often require non-trivial access methods that are not sup- 
ported by a general data format. 

The approach introduced in this paper tries to overcome this dilemma by meeting 
both demands: A role system is a generic way that enables text processing compo- 
nents to produce highly specific results. The role concept described in this paper has 
been adopted by the Tesla (Text Engineering Software Laboratory) framework. 

Keywords Component framework ■ Text engineering ■ Text mining. 



1 Introduction 

If a complex data analysis task can be divided into several subtasks, it is desirable to 
employ independent modules (components) to process the latter. Szyperski (1998) 
defines a software component as “a unit of composition with contractually specified 
interfaces and explicit context dependencies only. A software component can be 
deployed independently and is subject to composition by third parties.” The advan- 
tage of component architectures lies primarily in the fact that single components are 
not tailored for one specific context but can be reused in a variety of applications 
(see Veronis & Ide, 1996 for the aspect of reusability). In the field of text processing. 
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in particular, the use of a component architecture enables data and method exchange 
between various fields of research as text analysis is a major subject of a variety of 
different sciences such as linguistics, literature, bioinformatics and musicology (if 
scores are regarded as textual data). 

The major design question in component architectures is the interaction between 
the components. Text processing component systems are, at least partially, sequen- 
tially organized, i.e., the components are not employed in parallel independently of 
one another to process particular tasks, but are used in series, where each component 
builds upon the data produced by the previous one. For instance, a part-of-speech 
tagger can start processing only after the input text has been segmented into tokens 
by a tokenizer. 

Therefore it is advisable to specify a component interaction interface through the 
data produced by the components. This approach results in two conflicting require- 
ments for this interface (for a detailed discussion of the requirements see Gotz & 
Suhre, 2004). From the framework point of view, the interface should be as restricted 
as possible to ensure straightforward component interaction and component inter- 
changeability. From the component point of view, however, the interface should 
not be subject to rigid restrictions to guarantee meaningful results and flexible task 
handling. 

In this paper, we propose a role-based approach to overcome this dilemma. The 
rest of the paper is organized as follows. Section 2 outlines component interaction 
modeling in existing text processing systems. Section 3 provides a detailed discus- 
sion of the role-based approach to component interaction. Section 4 outlines the 
major advantages and limitations of the approach described above. 



2 Related Work 

Component-based architectures have been successfully applied to text processing 
since the 1990s (Cunningham & Bontcheva, 2006). GATE' is one of the first 
attempts to apply component pipeline architecture to text processing tasks. The 
information exchange between the components is realized here in a straightforward 
way: All GATE components process language resources and enhance them with 
certain annotations stored within an annotation graph that is defined in the ATLAS- 
Eramework (Bird et al., 1999). The annotations thus become part of language 
resources and are accessible to other components. 

This approach is further refined in the UlMA framework.^ The exchange format 
of UIMA components enables, in particular, the use of complex objects as anno- 
tations (referred to as Types). Types and their features are defined in the so called 
Type System that can be extended by UIMA users to include custom types. UIMA 
Types can contain a number of primitive data types, collections of those or other 



* http://www.gate.ac.uk. 

^ http://www.research.ibm.com/UIMA/. 
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UIMA Types, i.e., every object derived from a class contains attributes but no meth- 
ods. The objects in the Tesla role system possess both attributes and methods. This 
is one of the main differences between UIMA Types and our approach, see Sect. 3. 
An example of a generic UIMA annotation Type System is described in (Hahn et ah, 
2007). 

The above mentioned frameworks focus primarily on the annotation of tokens, 
i.e., text excerpts. Other frameworks, such as WEKA, that implement machine 
learning techniques (classihcation, clustering, etc.) are available for modeling rela- 
tions between tokens. The WEKA framework, however, poses additional restrictions 
on the input data format. In particular, an attribute-value matrix used as WEKA input 
has to be generated from raw text outside the framework. 



3 A Role System for Text Processing Components 

Feldman and Sanger (2006) summarize all text processing tasks under the notion 
of text mining: “Text mining can he broadly dehned as a knowledge-intensive pro- 
cess in which a user interacts with a document collection over time by using a suite 
of analysis tools.” Thus, a text mining application generates various types of out- 
put (e.g., patterns, connections, trends) for the raw input documents. Feldman and 
Sanger distinguish between preprocessing tasks (e.g., categorization or feature/term 
extraction) that convert raw text documents to a collection of processed documents 
necessary for further analysis and core mining operations (e.g., pattern identihca- 
tion and trend analysis) that handle the actual knowledge discovery. The frameworks 
described in the previous section are optimized either for preprocessing tasks (token 
annotation) or core mining operations (modeling of relations between tokens). 

The need for a text mining framework that integrates complete text mining appli- 
cations is motivated by the fact that the success of text mining applications can 
be determined only empirically. By convention, experiments in natural sciences are 
documented to facilitate experiment cross-checking. This documentation comprises 
input data, experiment setup and experiment results. If complex experiments are car- 
ried out within a single environment, experiment documentation (in the case of text 
processing tasks, it is represented by raw text, component description, configuration 
and the results produced by the components) is immediately available to the user. 
Moreover, an experiment within a single framework can be modified more easily, 
e.g., selected components can be reconfigured or replaced. Thus, such a text mining 
framework facilitates the evaluation of different methods. The rest of the section is 
organized as follows. 

First, a framework that enables the usage of both token-annotating and token- 
aggregating components is presented. The section proceeds with the discussion of 
the role model and its applications to interface restriction. The section concludes 
with a discussion of a sample usage of the role model to generate structure from 
textual data. 



288 



J. Hermes and S. Schwiebert 



3.1 The Tesla Framework 

Tesla (Text Engineering Software Laboratory^) is developed to provide a virtual 
scientific laboratory for researchers dealing with textual data processing. Tesla is 
aimed both at software developers who work with textual data and at users who are 
mainly concerned with experimenting on these data. 

Tesla is designed as a multi-level system that implements physical decomposition 
of graphical user interface and data processing: Several clients can connect to the 
Tesla server and use its resources, while the system requirements for the clients are 
comparatively low. The client is implemented as a rich client application based on 
Eclipse,"^ while the Tesla server is based on the Spring framework.^ Hibernate® is 
employed for object-relational mapping to SQL-databases. The above mentioned 
tools are well-established open source projects. 

Corpora, experiments and the data produced by the components are saved in 
Tesla’s database. Thus, experiment results are always available within a Tesla 
workgroup which enables experiment and result replication. Tesla is the first text 
engineering architecture that uses Java 5 annotations^ for component deployment 
ranging from the integration of Hibernate annotations for preprocessing and opti- 
mization of Java classes for their subsequent saving in a relational data base to the 
creation of custom Java 5 annotations that define I/O interfaces and configuration 
of single components. The integration of Tesla-specific features in the Eclipse IDE 
enables developers to design Tesla annotations based on any Java class. Tesla also 
possesses a user-friendly graphical editor for component configuration. The data 
produced in a specific experiment can be visualized using various generic views. A 
more detailed visualization of specific roles is realized through a well-documented 
interface that enables a user to develop new visualizations. 



3.2 The Relation Between Components and Roles 

As shown above, the Tesla framework does not pose any restrictions on the data 
produced by the components. The general idea of a component framework presup- 
poses, however, that the results produced by one component should be accessible 
to other components. Therefore, any framework must guarantee that its components 
are reusable and freely exchangeable. Hamlet, Mason, and Woit (1991) point out 



^ http://www.spinfo.uni-koeln.de/forschung/tesfa. 

^ http://www.eclipse.org. 

^ http://www.springframework.org. 

® http://www.hibemate.org. 

^ http://java.sun.eom/j2se/l.5.0/docs/guide/language/annotations.html. 
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that “the central dilemma of software design using components is that component 
developers cannot know how their components will be used and so cannot describe 
component properties for an unknown, arbitrary situation”. 

Thus, if a framework employs no global restriction on the component interface, 
the component interaction must be limited in a different way. One option is shown 
by van Gurp and Bosch (2002) who suggest assigning pre-defined roles to com- 
ponents: “By limiting the communication between two components by providing a 
smaller interface, the communication becomes more specific. Because of the smaller 
interfaces the communication also becomes easier to generalize (i.e., to apply a 
wider range of components). These two properties make the components both more 
versatile and reusable”. 

Applied to text processing components, the role concept described above is a 
mere definition of text processing tasks. The component assigned with the role 
tokenizer is expected to split the raw input text into discrete tokens. The role part- 
of-speech tagger presupposes that the component assigns part-of-speech tags to 
input tokens. An indexer combines tokens into types and counts the occurrences 
of the latter; a clusterer establishes relations between certain elements and provides 
information about these relations. 

The components in our framework have two interfaces. On the one hand, they 
consume data (raw text or the results produced by other components) through the 
input interface. On the other hand, they produce data through the output interface. 
The roles in our component-based system define only the output of the compo- 
nents as the requirements for input interfaces highly depend on concrete component 
implementations. The output interface of a component is defined through two 
elements as shown in Fig. 1 . 

First, a role defines the kind of data produced by the component, i.e., the DataOb- 
jects made available by the given component. Secondly, it defines the access to 
the data produced by the component, i.e., the data access methods provided by the 
AccessAdapter. The role shown in Fig. 1 does not define the data consumed by the 
component. The latter is specified within the component itself (Fig. 2). 

A role hierarchy is employed in our system to keep the role definitions unambigu- 
ous and to facilitate the creation of new roles. The base role in the Tesla role system 
is the Annotator, as the generation of annotations is the primary task of all compo- 
nents. It is further extended by the following roles. The Tokenizer includes all roles 
that split raw text in any possible way, such as morpheme, syllable, word, sentence 
or paragraph boundary detectors. The Enhancer encompasses all roles that enhance 
tokens with various kinds of information, i.e., part-of-speech tagger, gazetteer, clas- 
sifier, etc. The Aggregator refers to the roles that establish relations between tokens, 
e.g., clusterer or string matcher. The Generator describes the roles that generate an 
output, e.g., a translator. The above mentioned main roles are extended by more spe- 
cific ones. Component creators can choose any existing role from the role hierarchy 
and thus focus on the actual task without having to implement the DataObject or the 
AccessAdapter. 
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Fig. 2 Required input is defined at component level 
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The flexibility of the above mentioned role concept is manifested, in particular, in 
the integration of algorithms that use complex data structures as input or output as 
shown for the exemplary integration of the Alignment Based Learning (hereafter 
ABL) tool (van Zaanen, 1999). ABL is one of the first algorithmic adaptations of 
Harris’ distributionalism (Harris, 1951) in computational linguistics. It generates 
hypotheses about natural language through sequence alignment. While sequence 
alignment has been successfully employed in bioinformatics since the 1980s to gen- 
erate phylogenetic trees, etc., this method has been applied to linguistic tasks only 
recently and is used, for instance, in the field of machine translation (van Zaanen & 
Geertzen, 2006) and language reconstruction (Kondrak, 2002). 

ABL comprises three applications: Align, Cluster and Select. Their tasks are to 
detect constituents, to group the constituents belonging to the same category and, in 
case of overlapping constituents, to select the correct one. Each of the three com- 
ponents requires a treebank as input (and also generates one) which consists of a 
collection of sentences that may be followed by annotations describing the sentence 
constituents. 

While the annotations consumed by ABL pose no further restrictions and do 
not require a role system, the access to the annotations produced by ABL might 
be problematic in component systems which use restrictive data exchange formats, 
such as Gate’s Language Resource or UIMA’s CAS. It is not desirable to represent 
all alignments as corpus annotations. Depending on the productivity of the align- 
ment algorithm during hypotheses generation, this might pose high requirements 
on memory. Furthermore, ABL output should be accessible to other components as 
alignment is only an auxiliary application. Simply granting direct access to align- 
ment data through annotations is not sufficient for a component but the component 
should provide task-specific access instead. For instance, if alignment is employed 
for hypothesis generation about paradigmatic relations between the constituents, it 
is desirable for the alignment component to provide such information as alignment 
scores of any two constituents or to find the constituents similar to a given one. 
Overall, the problem arises that a component must access the data which cannot be 
stored as an annotation attribute but involves further processing. Therefore, an align- 
ment component should possess interfaces that enable the abstraction from concrete 
data structures and their interior (in practice rarely sufficiently documented) logic 
and provide task-specific access methods. This can be achieved through a role sys- 
tem. The interfaces AccessAdapter and DataObject defined in a role enable the 
realization of role-specific algorithms (e.g., the detection of similar constituents or 
constituent clusters) in the form of (Java) methods. The latter can be subsequently 
used by other components ignoring the details of the actual implementation. 

In Tesla, the counterpart of the ABL-application Align requires two input roles: 
Tokenizer and Annotator. The Tokenizer is employed to detect sentence boundaries, 
while the Annotator defines the kind of data that is to be processed in the following 
steps. As the Annotator is the base role in Tesla, it is basically possible to produce 
alignments from the output of any Tesla component. Thus, it is possible to produce 
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not only word-based alignments but also alignments based on syntactic categories, 
etc. Figure 3 shows a screenshot of the Tesla experiment editor in which two separate 
ABL instances are used on different data. 

The role system enables flexibility in component interchange. Here, for instance, 
ABL components can be rearranged or replaced for test purposes by alignment 
algorithms widely employed in bioinformatics research (such as BLAST Altschul, 
Gish, Miller, Myers, & Lipman, 1990) without any concomitant change of program 
code. Only the interface defined in the given role must be implemented for the new 
algorithm. 



4 Discussion 

The role system seeks to overcome the above mentioned dilemma between the 
framework and component requirements through coupling data and data access. 
Even complex relationships between the components within an experiment can eas- 
ily be modeled in the integrated graphical editor (as shown in Fig. 3). This enables 
users to focus on their actual research without having to deal directly with the role 
system in all its complexity. As the Tesla Client is designed as a collection of plugins 
that extend Eclipse, several third party plugins, including versioning, tools for com- 
munication and collaboration, visualization, etc., can provide other helpful tools for 
Tesla developers and users. 




Fig. 3 Screenshot of the Tesla Client 










Classification of Text Processing Components: The Tesla Role System 



293 



Still, several limitations of the role system need to be outlined. First, component 
development and the creation of new roles currently requires a lot of work. While 
component development, depending on the role definition, assumes the implementa- 
tion of various methods and data structures (even if it is possible to reuse predefined 
classes), the definition of new roles presupposes an extensive conception phase 
that must consider subsequent processing of the data produced by the role. But 
this excessive work can be reduced through general framework improvements. For 
instance, component development is simplified by extending programming aids 
offered by the Eclipse IDE to include framework specific features. Moreover, it 
should be tested whether component compatibility can be increased through the 
employment of semantic nets (such as RDE or topic maps) in combination with 
reflective programming techniques. For instance, an intermediate layer can be inte- 
grated into the framework to detect seemingly incompatible but in reality fungible 
interfaces if this can be induced from a semantic net. Current developments in 
the field of web technology, such as remote procedure calls, indicate that this is 
technically possible. 
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Nonparametric Distribution Analysis 
for Text Mining 



Alexandres Karatzoglou, Ingo Feinerer, and Kurt Hornik 



Abstract A number of new algorithms for nonparametric distribution analysis 
based on Maximum Mean Discrepancy measures have been recently introduced. 
These novel algorithms operate in Hilbert space and can be used for nonparamet- 
ric two-sample tests. Coupled with recent advances in string kernels, these methods 
extend the scope of kernel-based methods in the area of text mining. 

We review these kernel-based two-sample tests focusing on text mining where 
we will propose novel applications and present an efficient implementation in 
the kernlab package. We also present an efficient and integrated environment 
for applying modern machine learning methods to complex text mining problems 
through the combined use of the tm (for text mining) and the kernlab (for 
kernel-based learning) R packages. 

Keywords Kernel methods ■ R • Text mining. 



1 Introduction 

Recent advances in the field of machine learning provide an ever enhanced arsenal 
of methods that can be used for inference and analysis in the domain of text min- 
ing. Machine learning techniques are at the basis of many modern text applications 
such as document filtering and ranking. Kernel-based methods have been shown to 
perform strongly in this area, particularly in text classification with Support Vector 
Machines using either a simple “bag of words” representation (i.e., term frequencies 
with various normalizations) (Joachims, 2002), or more sophisticated approaches 
like string kernels (Lodhi, Saunders, Shawe-Taylor, Cristianini, & Watkins, 2002), 
or word-sequence kernels (Cancedda, Gaussier, Goutte, & Renders, 2003). 

Many advances in the area of kernel-based machine learning have not yet 
been introduced into the field of text mining. By taking advantage of the 
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R (R Development Core Team, 2008) statistical programming environment and the 
kernlab (Karatzoglou, Smola, Hornik, & Zeileis, 2004) package along with the 
tm (Feinerer, Hornik, & Meyer, 2008) package for text mining we will introduce 
novel applications of a new algorithm for the two-sample test in the area of text 
mining. This test combined with state-of-the-art kernels for text data can be highly 
useful for authorship attribution (Holmes, 1994; Malyutov, 2006), stylometry, and 
linguistic forensics. We will also introduce a fast and memory efficient implementa- 
tion of string kernels which allows for the application of any kernel method to very 
large text documents. 



2 Maximum Mean Discrepancy 

In Gretton, Borgwardt, Rasch, Scholkopf, and Smola (2007) a nonparametric test 
over two samples was introduced. The test is based on finding a smooth function 
that returns different values for samples drawn from different distributions. The test 
statistic that is used to compare the two samples is the difference of the means of 
a function from a function class T . If the test statistic exceeds a certain calculated 
bound then the samples are likely to have been drawn from different distributions. 

Test Statistic 

The test statistic that is used in Gretton et al. (2007) is the Maximum Mean Dis- 
crepancy (MMD) which depends on the function class of T . This function class is 
chosen to be the unit ball in Reproducing Kernel Hilbert Space (RKHS). Consider 
two samples X = {xi, . . . , x„,} and Y = {yi. . . . , y„} drawn independently and 
identically distributed (i.i.d) from distributions p and q respectively then if IF is a 
function class with £) — ^ K for a given domain D then the MMD is defined as 



where the kernel k implicitly defines the function class T . Given the set Z = 
{zi, ... ,Zm} for in i.i.d. random variables with zi = (x, , y,) an empirical estimate 
of MMD^ is given by 



MMD[T, p, q] = sup (E,^^^[/(x)] - Ej...^[/(y)]) . 



( 1 ) 



/6.F 



Selecting the unit ball in RKHS as the function class of T leads to 



MMLp'[T,p,q\ = ¥.x,x'~p[k{x,x')] 

- 7Ex~.p.y~^q[k{x, y)] -h Ej,,y^,[k(y, y')]. (2) 




( 3 ) 
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which corresponds to a one sample U-statistic (Gretton et al., 2007) with h(zi ,Zj) = 

k(xi,Xj) + k(yi , yj) - k(xi , yj) - k(xj , yi). 



Bound 

The test statistic is used to distinguish between the null hypothesis Ho that the sam- 
ples come from the same distribution and the alternative hypothesis Hi where the 
samples come from different distributions. This is simply achieved by computing 
a bound for the test statistic above for which the Hq hypothesis is rejected. This 
bound is subject to an upper bound a on the probability of a Type I error, i.e., the 
probability of rejecting the Hq hypothesis although it is true. 

As shown in Gretton et al. (2007) under the Hi hypothesis the MMO^ that is the 
MMD of the U-statistic converges asymptotically to a Gaussian distribution 

m j (MMDI - MMD^[T, p, q]) iV(0, a^), (4) 

where = 4(EJ(E;//j(z, z'))^] ~ \^z,z'{h{z,z!))f') while ^~/h{z,z!) = 0 holds 
under Ho and MMO^ converges to 



( 5 ) 

I = l 

where Zi ~ A^(0, 2) i.i.d. and A, are derived from the eigenvalue problem 

j k{x.x')fi{x)dp(x) = Xii/i(x'), (6) 

where k{x, x') is the centered kernel in feature space, i/f,- (x') the eigenvectors and 
the integral is with respect to the distribution p. 

A way of computing a bound to the test statistic is by finding an approximation 
to the (1 — o;) quantiles of the MMD^ under the Ho hypothesis. One asymptotic way 
to calculate this bound is by using bootstrapping on the original data and another by 
fitting Pearson curves to the first moments (Gretton et al., 2007). 



3 String Kernels 

One of the main advantages of kernel-based machine learning methods is the ability 
to apply such methods directly on structured data such as text documents without 
the need of a feature extracting step. Kernels used on text range from word-sequence 
kernels which directly compare the words between two documents to string kernels 
which are comparing strings appearing in the documents. 
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String kernels work by calculating a weighted sum over the common substring 
between two strings, as shown in (7). Different types of kernels arise by the use of 
different weighting schemas. The generic form of string kernels between two sets 
of characters x and x' is given by the equation 



where A* represents the set of all non empty strings and Aj is a weight or decay 
factor which can be chosen to be fixed for all substrings or can be set to a different 
value for each substring. This generic representation includes a large number of 
special cases, e.g., setting Aj 7 ^ 0 only for substrings that start and end with a 
whitespace character gives the “bag of words” kernel. In this paper we consider 
four different types of string kernels: 

• Constant (constant): All common substrings are matched and weighted equally. 

• Exponential decay (exponential): All common substrings are matched but the 
substring weight decays as the matching substring gets shorter. 

• A-spectrum (spectrum): This kernel considers only matching substrings of exactly 
length k, i.e., A^ = 1 for all |^| = k. 

• Bounded range (boundrange) kernel where A^j = 0 for all kl > n that is 
comparing all substrings of length less or equal to a given length n. 

String kernels can be computed by building the suffix tree of a string x and 
computing the matching statistics of a string x' by traversing string x' through the 
suffix tree of x. Given a suffix tree it can be proven that the occurrence of a certain 
substring y can be calculated by the number of nodes at the end of the path of y 
in the suffix tree. Auxiliary suffix links, linking identical suffixes in the tree are 
utilized to speed up the computations (Vishwanathan & Smola, 2003). Two main 
suffix tree operations are required to compute string kernels, a top down traversal 
for annotation and a suffix link traversal for computing matching statistics, both 
operations can be performed more efficiently on a suffix array. 

The enhanced suffix array (Abouelhoda, Kurtz, & Ohlebusch, 2004) of a string 
X, is an array of integers corresponding to the lexicographically sorted suffixes of 
X with additional information stored to allow for the reproduction of almost all 
operations available on a suffix tree. Suffix arrays bring the advantage of better 
memory use and locality thus most operations can be performed faster than on the 
original suffix trees (Teo & Vishwanathan, 2006). 



4 R Infrastructure 

R (R Development Core Team, 2008) provides a unique environment for text mining 
but has until recently lacked tools that would provide the necessary infrastructure in 
order to handle text and compute basic text related operations. Package tm provides 
this functionality. 
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4.1 tm 

The tm (Feinerer, Hornik, & Meyer, 2008) package provides a framework for 
text mining applications in R. It offers functionality for managing text documents, 
abstracts the process of document manipulation and eases the usage of heteroge- 
neous text formats in R. The package has integrated database backend support to 
minimize memory demands. An advanced meta data management is implemented 
for collections of text documents to alleviate the usage of large and with meta data 
enriched document sets. Its data structures and algorithms can be extended to fit 
custom demands, since the package is designed in a modular way to enable easy 
integration of new hie formats, readers, transformations and hlter operations, tm 
provides easy access to preprocessing and manipulation mechanisms such as whites- 
pace removal, stemming, or conversion between hie formats. Further a generic hlter 
architecture is available in order to hlter documents for certain criteria, or perform 
full text search. 



4.2 kernlab 

kernlab (Karatzoglou et ah, 2004) is an extensible package for kernel-based 
machine learning methods in R. The package contains implementations of most 
popular kernels, including a fast suffix-array based implementation of string kernels 
and also provides a range of kernel methods for classihcation, regression (Support 
Vector Machine, Relevance Vector Machine), clustering (kernel A:-means, Spectral 
Clustering), ranking, and Principal Component Analysis (PCA). 



4.3 Framework for Kernel Methods on Text in R 

During the development of tm interoperability between kernlab and tm was a 
major design goal. TextDocument collections are essentially lists of character 
vectors where each character vector contains a text document, and can be used 
directly as input data for all corresponding functions in kernlab. Functions in 
tm can be used to preprocess the text collections and perform operations such as 
stemming, part-of-speech tagging, or searches utilizing full text and meta data. 
In combination the two packages provide a unique framework in R for applying 
modern kernel-based Machine Learning methods to large text document collections. 



4.4 Kernel MMD 

A kernel-based implementation of MMD is provided in the kernlab kmmd() 
function. The implementation provides interfaces for data in matrix format, as a 
kernel matrix or as a list (e.g., a TextDocument collection). Since the algorithm 
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is based on sums over terms in kernel matrices it is implemented by computing 
these kernel matrices in a block-wise manner and adding up the computed block to 
the final sum. This avoids any memory problems that might occur when calculating 
a whole kernel matrix of a large data set in memory. 



5 Experiments 

We evaluate the performance of the algorithm on on-line available text data. The 
experiments are performed either hy drawing two samples from a single text collec- 
tion or topic or by comparing texts from different collections or topics, all with a 
confidence level of a = 0.05. 



5.1 Data 

Our first data set is a subset of the Reuters-21578 data set (Lewis, 1997) containing 
stories collected by the Reuters news agency. The data set is publicly available and 
has been widely used in text mining research within the last decade. Our subset 
contains an excerpt of 800 documents in the category “acq” (articles on acquisitions 
and mergers) and an excerpt of 400 documents in the category “crude” (stories in 
the context of crude oil). 

The second data set is a subset of the SpamAssassin public mail corpus (http: 
//spamassassin.apache.org/publiccorpus/). It is freely available and offers authentic 
e-mail communication with classifications into normal (ham) and unsolicited (spam) 
mail. For the experiment we had 400 spam documents available. 

The third data set consists of books from the famous The Wizard of Oz series 
(freely downloadable via the Gutenberg Project at http://www.gutenberg.org/) which 
has been among the most popular children’s novels in the last century. The firsts 
book were created and written by Lyman Frank Baum, first published in 1900. A 
series of Oz books followed until Baum died in 1919. After his death Ruth Plumly 
Thompson continued the story of Oz books. We had 21 books available written from 
either of the two authors. 



5,2 Results 

We see that the test performs very well on all our experiment cases when using string 
kernels, almost irrelevant of the string kernel type and parameters. The tables depict 
the data sets that are being tested along with the kernel and length that is used. The 
test returns TRUE when the Hq hypothesis is rejected. The Asymptotic bound, the 
first and third order moments are also given in the tables. In detail with data that 
comes from the same topic (and hence we assume from the same distribution) we 
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Table 1 Kernel MMD results for two identical data sets (using the Reuters “crude” corpus) under 
different string kernels 



Data Set 


Type 


Arg 


fforej. 


AsymBound 


1 St moment 


3rd moment 


CrudeV sCrude 


Exponential 




FALSE 


0.013 


0.000 


0.000 


CrudeV sCrude 


Constant 




FALSE 


0.057 


0.000 


0.000 


CrudeV sCrude 


Spectrum 


4 


FALSE 


0.048 


0.026 


0.001 


CrudeV sCrude 


Spectrum 


6 


FALSE 


0.037 


0.000 


0.000 


CrudeV sCrude 


Spectrum 


8 


FALSE 


0.053 


0.010 


0.000 


CrudeV sCrude 


Spectrum 


10 


FALSE 


0.053 


0.000 


0.000 


CrudeV sCrude 


Boundrange 


4 


FALSE 


0.005 


0.000 


0.000 


CrudeV sCrude 


Boundrange 


6 


FALSE 


0.008 


0.000 


0.000 


CrudeV sCrude 


Boundrange 


8 


FALSE 


0.008 


0.000 


0.000 


Crude VsCrude 


Boundrange 


10 


FALSE 


0.009 


0.000 


0.000 



Table 2 Kernel MMD results testing Reuters “crude” and “acq” data sets under different string 
kernels 



Data Set 


Type 


Arg 


Ha rej. 


AsymBound 


1st moment 


3rd moment 


CrudeV sAcq 


Exponential 




TRUE 


0.011 


0.190 


0.015 


Crude Vs Acq 


Constant 




TRUE 


0.032 


0.304 


0.006 


CrudeV sAcq 


Spectrum 


4 


TRUE 


0.034 


0.415 


0.101 


CrudeV sAcq 


Spectrum 


6 


TRUE 


0.041 


0.406 


0.076 


CrudeV sAcq 


Spectrum 


8 


TRUE 


0.038 


0.376 


0.046 


Crude Vs Acq 


Spectrum 


10 


TRUE 


0.042 


0.356 


0.029 


Crude Vs Acq 


Boundrange 


4 


TRUE 


0.006 


0.150 


0.013 


Crude Vs Acq 


Boundrange 


6 


TRUE 


0.008 


0.173 


0.016 


Crude Vs Acq 


Boundrange 


8 


TRUE 


0.008 


0.187 


0.018 


CrudeV sAcq 


Boundrange 


10 


TRUE 


0.010 


0.196 


0.019 



Table 3 KMMD results testing Reuters “crude” data set against SpamAssassin mail coipus under 
different string kernels 



Data Set 


Type 


Arg 


Ha rej. 


AsymBound 


1st moment 


3rd moment 


AcqVsSpam 


Exponential 




TRUE 


0.010 


0.392 


0.138 


AcqVsSpam 


Constant 




TRUE 


0.012 


0.288 


0.047 


AcqVsSpam 


Spectrum 


4 


TRUE 


0.017 


0.514 


0.232 


AcqVsSpam 


Spectrum 


6 


TRUE 


0.017 


0.448 


0.164 


AcqVsSpam 


Spectrum 


8 


TRUE 


0.013 


0.376 


0.103 


AcqVsSpam 


Spectrum 


10 


TRUE 


0.014 


0.336 


0.074 


AcqVsSpam 


Boundrange 


4 


TRUE 


0.011 


0.387 


0.139 


AcqVsSpam 


Boundrange 


6 


TRUE 


0.011 


0.403 


0.150 


AcqVsSpam 


Boundrange 


8 


TRUE 


0.011 


0.408 


0.153 


AcqVsSpam 


Boundrange 


10 


TRUE 


0.012 


0.409 


0.153 



have the Hq hypothesis not rejected (see Table 1) while data coming from differ- 
ent topics get the Hq hypothesis rejected by the test (see Tables 2, 3, and 4). It is 
especially interesting that the tests work both in cases where the data sets are rather 
similar (e.g., both corpora are from Reuters but on a different topic, as shown in 
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Table 4 Kernel MMD results testing Reuters “crude” data set against Wizard of Oz books under 
different string kernels 



Data Set 


Type 


Arg 


Ha rej. 


AsymBound 


1st moment 


3rd moment 


CrudeVsOz 


Exponential 




TRUE 


0.023 


0.275 


0.060 


Crude VsOz 


Constant 




TRUE 


0.097 


0.427 


0.010 


CrudeVsOz 


Spectrum 


4 


TRUE 


0.120 


0.747 


0.485 


CrudeVsOz 


Spectrum 


6 


TRUE 


0.161 


0.869 


0.640 


CrudeVsOz 


Spectrum 


8 


TRUE 


0.156 


0.783 


0.466 


CrudeVsOz 


Spectrum 


10 


TRUE 


0.134 


0.691 


0.309 


CrudeVsOz 


Boundrange 


4 


TRUE 


0.017 


0.253 


0.057 


CrudeVsOz 


Boundrange 


6 


TRUE 


0.017 


0.270 


0.063 


CrudeVsOz 


Boundrange 


8 


TRUE 


0.018 


0.280 


0.066 


CrudeVsOz 


Boundrange 


10 


TRUE 


0.018 


0.286 


0.067 



Table 5 MMD using term-document matrices along different weightings (term frequency, term 
frequency inverse document frequency, and binary) 



Data Set 


Type 


Arg 


Ha rej. 


AsymBound 


1st moment 


3rd moment 


CrudeV sCrude 


Linear 


tf 


FALSE 


37.282 


0.000 


0.000 


Crude VsCrude 


Linear 


tf-idf 


FALSE 


208.305 


0.000 


0.000 


Crude VsCrude 


Linear 


bin 


FALSE 


6.739 


0.000 


0.000 


CrudeVsAcq 


Linear 


tf 


FALSE 


39.918 


9.117 


51.368 


CrudeVsAcq 


Linear 


bin 


FALSE 


5.163 


3.807 


7.389 


AcqVsSpam 


Linear 


tf 


FALSE 


247.393 


23.519 


387.846 


AcqVsSpam 


Linear 


bin 


TRUE 


1.463 


4.605 


17.574 


CrudeVsOz 


Linear 


tf 


FALSE 


2, 206, 959.822 


3, 986.405 


15,818,506.600 


CrudeVsOz 


Linear 


bin 


FALSE 


326.956 


49.435 


2,253.067 



Table 2) but also for very different text collections (e.g., the novels on the Wizard of 
Oz compared to Reuters news article stories, as shown in Table 4). 

An important observation is that our good results clearly depend on the use 
of string kernels. We also implemented the MMD tests utilizing standard term- 
document matrices (see Table 5) leading to far inferior results compared to string 
kernels. 

We also conducted some experiments on using MMD for authorship attribution 
on the Wizard of Oz data set. However we were not able to clearly distinguish 
the stylometry between the two authors Baum and Thompson, especially for books 
where the authorship has also been unclear for literature experts for decades. Kernel 
Principal Component Analysis plots (see Figs. 1 and 2) confirms that the writing 
style is not easy distinguishable and that authorship attribution for the Wizard of Oz 
is a very hard problem in general (Binongo, 2003). 
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Fig. 1 Kernel Principal component plot of for five Oz books using 500 line chunks on two 
principal components 




Fig. 2 Kernel Principal component plot for five Oz books using 100 line chunks on two principal 
components 



304 



A. Karatzoglou et al. 



Running Time 

For our experiments we ran our tests on a high-performance cluster (multi-core 
processors at each processing node, 8-16 GB RAM depending on the cluster node) 
resulting in a running time of only a few minutes for all of our data sets. The tests 
also run very fast on contemporary desktop or workstation computers. Single tests 
run within seconds for normal-sized data sets. 



6 Conclusion 

We presented a new way of applying the two-sample test to text data which can be 
very useful for tagging, authorship attribution, stylometry, and linguistic forensics. 
We also presented an efficient implementation of the algorithm in the kernlab 
package and a new set of fast string kernels. Using these kernels we showed that the 
algorithm performs strongly on a number of texts. 
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Linear Coding of Non-linear Hierarchies: 
Revitalization of an Ancient Classification 
Method 



Wiebke Petersen 



Abstract The article treats the problem of forcing entities into a linear order which 
could be more naturally organized in a non-linear hierarchy (e.g., books in a library, 
products in a warehouse or store, . . . ). The key idea is to apply a technique for the 
linear coding of non-linear hierarchies which has been developed by the ancient 
grammarian Panini for the concise representation of sound classes. The article 
introduces briefly Panini’s technique and discusses a general theorem stating under 
which condition his technique can be applied. 

Keywords Classification • Hierarchy ■ Indian grammar theory • Panini. 



1 Introduction 

1.1 Why Are Linear Codings Desirable? 

There are several situations in daily life where one is confronted with the problem 
of being forced to order things linearly although they could be organized in a non- 
linear hierarchy more naturally. For example, due to the one-dimensional nature of 
book shelves, books in a library or a bookstore have to be placed next to each other 
in a linear order. One of the simplest solutions to this problem would be to order 
the books strictly with respect to their authors’ names and to ignore their thematic 
relationships. Such an arrangement forces a user of a library, who is usually inter- 
ested in literature on a special subject, to cover long distances while collecting the 
required books. Therefore, librarians normally choose a mixed strategy: books are 
classified according to their thematic subject and within each class they are ordered 
alphabetically with respect to their authors’ names. International standards for sub- 
ject classification usually claim that the classification has a tree structures, i.e., each 
thematic field is partitioned into disjoint subfields which are further subdivided into 
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disjoint sub-subfields and so forth (e.g., UDC: “Universal Decimal Classification”'). 
However, there will always be books which do not fit neatly into one single most 
specific class. 

Similar problems occur in many types of stores, too: food or toys are, like books, 
commonly presented on shelves, and clothes are hanging next to each other on racks. 
The question of whether honey should be placed in the breakfast or in the baking 
department, or whether clothes should be arranged by their color, by the season in 
which they are typically worn or their type (trousers, jackets, . . . ) is comparable to 
the book-placing problem. In warehouses it is beneficial to place the products such 
that those which are often ordered together are within striking distance in order 
to minimize the length of the course which has to be covered while carrying out 
an order. In this setting, the thematic classification becomes secondary since the 
“classes” are dynamically defined by the consumers’ orders. 

1.2 Pdmni’s Sivasutra-Technique 

The problem of linearly ordering entities which bear a complex non-linear relation- 
ship to each other is old and predates common product organizing problems. In 
spoken language everything has to be expressed linearly since language is linear by 
nature. In ancient India, people were very aware of this problem since their culture 
based on an oral tradition where script was mainly reserved for profane tasks like 
trading or administration. Since any text which was considered worth to be pre- 
served was taught via endless recitations, keeping texts as concise as possible was 
desirable. The aspiration after conciseness is especially noticeable in grammar, for 
which many techniques to improve the compactness of the grammatical descrip- 
tions were invented. Grammar was regarded as the sdstrdndm sdstram “science of 
sciences” since it aimed at the preservation of the Vedas, the holy scriptures, of 
which the oldest parts date around 1200BC (cf. Staal, 1982). 

The culmination point of ancient Indian grammar was Panini’s Sanskrit gram- 
mar (Bohtlingk, 1887) which dates circa 350 BC. Its main part consists of about 
4,000 rules, many of them phonological rules which describe the complex system of 
Sandhi. Sandhi processes are regular phonological processes which are triggered by 
the junction of two words or morphemes;^ they are very common in Sanskrit. Phono- 
logical rules are typically of the form “sounds of class A are replaced by sounds of 
class B if they are preceded by sounds of class C and followed by sounds of class 
D"? Since it is not economical to enumerate for each single rule all sounds which 



* http://www.udcc.org/, http://www.udc-online.com/. 

^ E.g., the regular alternation between a and an of the indefinite article in English is a Sandhi 
phenomenon. 

^ In modem phonology such a mle is denoted as 

A ^ Bj c D ■ (1) 

Since Panini’s grammar was designed for oral tradition, he could not make use of visual sym- 
bols (like arrows, slashes, . . . ) to indicate the role of the sound classes in a mle. He takes case 
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3TWOT OT smroRTT (I) 

d'l-'IS'SiiL 'n'-ld 5TW3 

a-i-un r-lk e-on ai-auc hayavarat Ian namanananam jhabhaii (II) 

ghadhadhas jabagadadas khaphachathathacatatav kapay sasasar hal 

aiuMi rlM 2 eoMs aiauM 4 hyvrMs IMe nmnnnA /7 jhbhMg (III) 



ghdhdhMg jbgddMio kh ph ch tilth c 1 1 Mn kpMi 2 sssMn hMn 



Fig. 1 Panini’s Sivastitras (I: Devanagarl script; II: Latin transcription; III: Analysis - the syllable- 
building vowels are left out and the meta-linguistical consonants marking the end of a sutra are 
replaced by neutral markers M, ) 



are involved in it, an appropriate phonological description must include a method to 
denote sound classes. The method should he such that addressing a natural phono- 
logical class becomes easier than addressing an arbitrary set of sounds. In modern 
phonology one often chooses a set of binary phonetic features like [iconsonantal] 
or [ihvoiced] in order to define the relevant sound classes. This approach necessarily 
involves the problem of choosing and naming features and the danger of defining 
ad-hoc features. However, Panini’s method of addressing the relevant sound classes 
evades this problem. 

The first 14 sutras of Panini’s Sanskrit grammar are called Sivasutras and quoted 
in Fig. 1 . Each sutra consists of a sequence of sounds ending in a consonant. This 
last consonant of each sutra is used meta-linguistically as a marker to indicate the 
end of a sutra. As the system behind the naming of the markers is unknown (cf. 
Misra, 1966), we have replaced them in Fig. 1 (III) by neutral markers M\ up to 
M\n. Together the Sivasutras define a linear order on the sounds of Sanskrit. The 
order is such that each class of sounds on which a phonological rule of Panini’s 
grammar operates forms an interval which ends immediately before a marker ele- 
ment. As a result, Panini could use pairs consisting of a sound and a marker element 
in order to designate the sound classes in his grammar. Such a pair denotes the con- 
tinuous sequence of sounds in the interval between the sound and the marker. E.g., 
the pair iM 2 denotes the class {i, u, r, 1}. 

The question whether Panini arranged the sounds in the Sivasutras in an optimal 
way and especially whether the double occurrence of the sound h (in the 5th and in 
the 14th sutra) is necessary has been widely discussed (cf. Bdhtlingk, 1887; Staal, 
1962; Cardona, 1969; Kiparsky, 1991). In Petersen (2004) it could be proven that 



suffixes instead which he uses meta-linguistically in order to mark the role a class plays in a rule. 
In Paninian style rule (1) becomes 



A -|- genitive, B nominative, C -|- ablative, D locative. 



(2) 
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there is no shorter solution than the Sivasutras to the problem of ordering the sounds 
of Sanskrit in a linear, by markers interrupted list with as few repeated sounds as 
possible such that each phonological class which is denoted by a sound-marker pair 
in Panini’s grammar can be represented by such a pair with respect to the list."^ This 
shows that the double occurrence of h is not superfluous and that Panini used a 
minimal number of markers in the Sivasutras. 



2 Linear Coding of Non-linear Hierarchies: Generalizing 
Panini’s Sivasutra-Technique 



In Sect. 1.1 we have argued that there are several situations in which it is required to 
force entities in a linear order although it would be more natural to organize them 
in a non-linear hierarchy. The aim of the present section is to show that Panini’s 
Sivasutra-technique, which has been introduced in Sect. 1.2, may offer a solution to 
the mentioned problem in many situations. 



2.1 S-Orders and S-Sortability: Formal Foundations 

All ordering problems mentioned in Sect. 1 . 1 are based on a common problem: 

Problem 1. Given a set of classes of entities (no matter on what aspects the classi- 
fication is based) order the entities in a linear order such that each single class forms 
a continuous interval with respect to that order. 

Take for example the problem of ordering books in a library. It would be favor- 
able to order the books linearly on the bookshelves such that all the books belonging 
to one thematic subfield are placed next to each other on the shelves without having 
to add additional copies of a book into the order. 

Panini solved Problem 1 with his Sivasutras in a concrete case: The Sivasutras 
define a linear order on the set of sounds in Sanskrit (with one sound occurring 
twice) in which each class of sounds required in his grammar forms a continuous 
interval.^ In order to solve concrete instances of Problem 1, one can do with- 
out Panini’s special technique of interrupting the order by marker elements such 
that each class interval ends immediately before a marker. In tribute to Panini’s 
Sivasutras we call a linear order which solves an instance of Problem 1 a Sivasutra- 
order or short S-order. A set of classes is said to be S-sortable without duplications 



^Actually, the Sivasutras are one of nearly 12,000,000 arrangements which are equal in length 
(Petersen, 2008). 

^ Actually, for the denotation of some sound classes Panini used different techniques in his 
grammar (for details see Petersen, 2008, 2009). 
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if it forms a solvable instance of Problem 1, i.e., if a corresponding S-order exists. 
Obviously, not every set of classes is S-sortable without duplications. For instance, 
even the Sivasutras do not define an S-order as they contain one sound twice. By 
proving that the double occurrence of h is unavoidable it has been shown that no S- 
order exists for the set of sound classes required by Panini’s grammar (cf. Petersen, 
2004). However, it should be clear that by a clever duplication of enough elements 
each set of classes can be S-sorted. 

The following definition summarizes formally the terminology which we will 
use hereinafter in order to generalize and apply Panini’s Sivasutra-technique: 

Definition 1. Given a base set A and a set of subsets <I> with (J <I> = yl, a linear 
order < on ^ is called an S-order of {A, <I>) if and only if the elements of each set 
e <I) form an interval in (A, <). 

Furthermore, {A, <I>) is said to be S-sortable without duplications if and only if 
there exists an S-order {A, <) of {A, $). 

Two simple examples serve us through the rest of the paper as illustrations: 

Example 1. Given the base set = {a,b,c,d,e, f,g,h,i} and the set of classes 

$ = {{d,e},{a,b},{h,c,d, f,g,h,i},{f,i}Ac,d,e, f,g,h,i},{g,h}}^ (A <J>) is 

S-sortable without duplications and a^b^c^g^h^ f ^/^cf^eisan 
S-order of {A, O). 

Example 2. Given the base set ^ = {a,b,c,d,e, f} and the set of classes 'J) = 
{{d, e) , {a , b} , {d , c , d} , {b . c , d . /}}, (A, <I>) is not S-sortable without duplications. 

Example 2 is not S-sortable without duplications since {b,c,d} e $ demands 
that no element of A\{b, c,d} may stand between any two elements of {b. c, d}. 
Furthermore, from {d, e} e <I> and {a, e $ it follows that either a < b < c < 
d < e or e < d < c < b < a is true. But this is impossible since it contradicts 
{b, c, d, /} G <I>. 

In the following, we will show how an S-order for a set of classes which is S- 
sortable without duplications can be constructed. Out of the construction process a 
condition for S-sortability can be derived. This condition is such that it can also help 
to identify those elements which must be duplicated in the case of a set of classes 
which is not S-sortable without duplications. 



2.2 Constructing S-Orders 

In the case of a set of classes which is S-sortable without duplications its S-orders 
can be read off from its enlarged concept lattice. The term concept lattice is taken 
from Formal Concept Analysis (FCA), i.e., a mathematical theory for the analysis 
of data (cf. Ganter, & Wille, 1999). We do not need to evolve the whole apparatus of 
FCA; it is sufficient to illustrate what we understand by the concept lattice of a set 
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Fig. 2 Concept lattice of (.4, <I>) from Example 1 




Fig. 3 Concept lattice of (.4, <I>) from Example 2 



of classes by an example: Given the base set A and the set of classes <I> from Exam- 
ple 1, the concept lattice of {A, <I>) is given in Fig. 2. It is constructed as follows: All 
elements of <I> and all possible intersections of elements of <I> are ordered by the set- 
inclusion relation such that subsets are placed above their supersets. Formally, Fig. 2 
shows the Hasse-diagram of the ordered set (yf U | = P| 'I' with 'I' C <!>}, 3).® 

In Fig. 2, you find below each node its corresponding set written. However, it 
is not necessary to label each node by its corresponding set since it is sufficient to 
write each element of the base set A to that node which corresponds to the smallest 
set which contains the element. The result of this more economical labeling method 
is shown in Fig. 2 by the labels above the nodes. The set corresponding to a node 
can be regained from the sparing labels by collecting all labels attached to nodes 
which can be reached by moving upwards in the graph. Hence, from now on, solely 
the sparing labels will be shown in figures of concept lattices like in Fig. 3 which 
shows the concept lattice for Example 2. 

Although it would be possible to read off the S-orders for Example 1 from the 
concept lattice in Fig. 2 (cf. Petersen, 2004, 2008), it is easier to switch to the con- 
cept lattice of the enlarged set of classes. Enlarging the set of classes means adding 
each element of the base set as a singleton set to the set of classes, e.g., in the case of 



® The Hasse-diagram of a partially ordered set is the directed graph whose vertices are the elements 
of the set and whose edges correspond to the upper neighbor relation determined by the partial 
order. 
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Fig. 4 Enlarged concept lattice for Example 1 



Example 1 the set of classes has to be enlarged by the classes {a}, {h}, {c}, 

The enlarged concept lattice corresponding to Example 1 is shown in Fig. 4. 

The following theorem states the connection between S -orders and concept 
lattices of enlarged sets of classes: 

Theorem 1. A set of classes {A, 'J>) is S-sortable without duplications if and only if 
a plane drawing of the Hasse-diagram of the concept lattice of the enlarged set of 
classes {A, O) exists (O = U {{a} | a € A})J 

The full proof of this theorem is given in Petersen (2008) and sketched in 
Petersen (2009). It follows immediately from the definition of our concept lattices 
that whenever the Hasse-diagram of an enlarged concept lattice can be drawn with- 
out intersecting edges then an S-order without duplications exists: Concept lattices 
order sets by set inclusion; this ensures that the labels belonging to the elements of 
one class out of a set of classes form an interval in the sequence defined by the left- 
to-right order of the labels in a plane drawing of the Hasse-diagram of the enlarged 
concept lattice. It follows that this left-to-right order defines an S-order without 
duplications of the set of classes. For example, the plane Hasse-diagram in Fig. 4 
defines the S-order e<d<c<i< f < h < g < b < a for the set of classes 
from Example 1 . 

The proof of the reversed statement, i.e., that the existence of an S-order implies 
the existence of a plane drawing of the Hasse-diagram, was first given in Petersen 
(2004). The proof involves an explicit construction of a plane drawing of the Hasse- 
diagram of the enlarged concept lattice for any S-order of any S-sortable set of 
classes. The construction guarantees that the left-to-right order of the labels equals 
the original S-order. 



^ A drawing of a Hasse-diagram is said to be plane if it shows no intersecting edges. 
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Fig. 5 Illustration of the distinct plane drawings of the enlarged concept lattice for Example 1 
resulting in different S-orders 



However, the construction method does not deterministically result in one S- 
order, since usually several plane drawings exist for the concept lattice of an 
enlarged set of classes. In fact, the proof of the theorem above implies that for every 
S-order there exists a plane drawing of the concept lattice from which it can be read 
off. Figure 5 illustrates the distinct plane drawings of the enlarged concept lattice 
for Example 1 which result in different S-orders, as for example: 



e<d<c<i<f < h < g 
e < d < i < f < c < h < g 
e < d < c < f < i < h < g 
e<d<c<i<f < g < h 
e < d <c<h<g<i < f 
a<b<g<h< f < i < c 



< b < a 




< b < a 


(3 variations) 


< b < a 


(2 variations) 


< b < a 


(2 variations) 


< b < a 


(2 variations) 


< d < e 


(2 variations) 



Altogether, Example 1 has 48 (= 3 x 2 x 2 x 2 x 2) distinct solutions, i.e., distinct 
S-orders. 



2.3 The Problem of Identifying Elements for Duplication 

Theoretically, Theorem 1 enables us to specify for each set of classes whether it is S- 
sortable without duplications or not. Since in the case of an S-sortable set of classes 
the proof of the theorem even establishes a method to construct a concrete S-order 
Theorem 1 solves Problem 1 in theory. Though in practise, deciding whether the 
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Fig. 6 Enlarged concept lattices for Example 2 



Hasse-diagram of a concept lattice can be drawn without intersecting edges is not 
trivial. For smaller examples like Example 2 a close inspection of the Hasse-diagram 
of the enlarged concept lattice given in Fig. 6 is sufficient to see that it is impossible 
to draw this Hasse-diagram without intersecting edges. This proves that the set of 
classes in Example 2 is not S-sortable. However, for more complex sets of classes, 
like the one given by the sound classes used in Panini’s Sanskrit grammar, the inves- 
tigation of the Hasse-diagrams becomes more awkward. Other necessary as well as 
sufficient conditions for S-sortability have been developed in Petersen (2008) which 
are easier to verify, but due to space limits they cannot be evolved here in detail. 
The most useful condition is based on the property of being bipartite of so-called 
Ferrers-graphs (cf. Ganter, & Wille, 1999; Petersen, 2008, 2009; Zschalig, 2007). 
Whether a graph is bipartite can be checked algorithmically; hence, this condition 
opens up a way of investigating more complex sets of classes automatically. 

The problem of identifying the best candidates for duplication is intricate, too. 
In order to construct an optimal S-order for a set of classes which is not S-sortable 
one has to identify those elements whose duplication leads to the “shortest” S-order, 
i.e., the aim is to duplicate as few elements as possible. In the case of Example 2 it 
is sufficient to duplicate one element, namely for example b. Duplicating element b 
means adding a copy b' to the base set A and changing some instances of b in the 
set of subsets lo b' . One of the optimal solutions to Example 2 is to duplicate b 
such that the new base set becomes {a, b, b' , c, d, e, /} and the new set of subsets 
becomes {{d,e},{a,b'},{b,c,d},{h,c,d, f}}. An S-order of the new set of classes 
with one duplication is for instance 

/ <b<c<d<e<a<b'. 

In Petersen (2008) a whole battery of methods for the identification of elements 
which are good candidates for duplication is developed. Although it can still be hard 
to identify good candidate elements for duplication, the problem becomes much 
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less challenging if one does not ask for absolutely minimal but just quite minimal 
S-orders. 

For instance, in situations as described in Sect. 1.1, where books or products have 
to be forced into a linear order, adding additional copies is expensive and space 
consuming, but not impossible. It can be assumed that by applying S-orders less 
books or products have to be placed at two distinct regions than by applying standard 
mono-hierarchical classibcation methods. 
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Abstract Automatically generating bilingual dictionaries from parallel, manually 
translated texts is a well established technique that works well in practice. However, 
parallel texts are a scarce resource. Therefore, it is desirable also to be able to gener- 
ate dictionaries from pairs of comparable monolingual corpora. For most languages, 
such corpora are much easier to acquire, and often in considerably larger quantities. 
In this paper we present the implementation of an algorithm which exploits such 
corpora with good success. Based on the assumption that the co-occurrence pat- 
terns between different languages are related, it expands a small base lexicon. For 
improved performance, it also realizes a novel interlingua approach. That is, if cor- 
pora of more than two languages are available, the translations from one language to 
another can be determined not only directly, but also indirectly via a pivot language. 

Keywords Comparable corpora • Dictionary generation • Multilingual texts • Word 
translations 



1 Introduction 

Until some years ago, English has been the primary language of the world wide 
web. However, with web contents expanding from initially mainly technical topics 
to topics of almost any aspect of life, there is a tendency for web publishers to adopt 
the native tongue of the intended audience. This leads to a significant increase of 
web pages written in languages other than English. 

This means for the English native speaker that it will be harder and harder to hnd 
the information in his mother tongue, a situation familiar to speakers of all other 
languages. As a consequence of the web becoming increasingly multilingual, as 
well as of globalization in general, the need for affordable dictionaries is growing. 
To be able to optimally exploit the information on the web, dictionaries between all 
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language pairs would be desirable. However, with 6,800 living languages, of which 
600 exist in written form, this is not very realistic. But even if we consider only 
the 100 main languages, which cover 95% of world population, there are still 4,450 
possible language pairs (9,900 directions) requiring dictionaries. 

The need for dictionaries between a large number of language pairs makes self- 
learning systems an interesting option. Such systems are able to automatically 
extract raw versions of dictionaries from translated texts. However, the required 
parallel texts are a scarce resource.^ Despite all efforts to mine parallel texts from 
pairs of monolingual corpora (Munteanu & Marcu, 2005; Wu & Fung, 2005), the 
required quantities of such data are not available for most language pairs (Rapp & 
Martin Vide, 2007).^ 

This is why we propose a methodology for dictionary extraction directly operat- 
ing on monolingual corpora. As monolingual corpora are far easier to acquire than 
their bilingual counterparts, this should considerably diminish the data acquisition 
bottleneck. This is all the more true as in the case of monolingual corpora one corpus 
per language is usually sufficient, whereas with parallel corpora one corpus per lan- 
guage pair is required. Consequently, instead of a linear increase there is a quadratic 
increase with the number of languages. 

The basic assumption underlying our approach is that across languages there is a 
correlation between the co-occurrence patterns of words that are mutual translations. 
If, for example, in language A two words co-occur more often than expected by 
chance, then their translated equivalents in language B should also co-occur more 
frequently than expected. In a feasibility study (Rapp, 1995) we showed that this 
assumption holds for English and German even in the case of unrelated texts. When 
comparing an English and a German co-occurrence matrix of corresponding words, 
we found a high correlation between the co-occurrence patterns of the two matrices 
when the rows and columns of both matrices were in corresponding word order, 
whereas the correlation was low when the rows and columns were in random order. 

The validity of this co-occurrence constraint is obvious for parallel corpora, but - 
as described above - it also holds for non-parallel corpora. It can be expected that 
this constraint will work best with parallel corpora, second-best with comparable 
corpora, and somewhat worse with unrelated corpora. Robustness is not a big issue 
in any of these cases. In contrast, when applying sentence alignment algorithms 
to parallel corpora, omissions, insertions, and transpositions of text segments have 
critical negative effects. However, the co-occurrence constraint when applied to 
comparable corpora is much weaker than the word-order constraint as used with 
parallel corpora. This is why larger corpora and well-chosen statistical methods are 
needed. 



* Examples are the parallel corpora derived from the proceedings of the European parliament (Arm- 
strong, Kempen, McKelvie, Petitpierre, Rapp, et al., 1998; Koehn, 2005) and the JRC-Aquis corpus 
(Steinberger, Pouliquen, Widiger, Ignat, Erjavec, et al., 2006). 

^ Eor an overview on the availability of parallel texts for various languages, see Mike Maxwell’s 
posting on the corpora mailing list of Eebmary 27, 2008, with subject line “quantities of publicly 
available parallel text”, archived at http://listserv.linguistlist.org/archives/corpora.html. 
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The current paper can be seen as a continuation of our previous work (Rapp, 
1995, 1999). We present a novel algorithm and provide quantitative results for six 
language pairs rather than for just one. Related work has been conducted by Fung 
and Yee (1998), Fung and McKeown (1997), and Chiao, Sta, and Zweigenbaum 
(2004). By presupposing a lexicon of seed words, Fung and McKeown avoid the 
prohibitively expensive computational effort encountered by Rapp (1995). The 
method described here goes in the same direction. By assuming the existence of an 
initial lexicon we significantly reduce the search space: We only conduct a relatively 
small number of vector comparisons instead of considering a very large number of 
permutations concerning potential correspondences of word order. 

Another new feature of this work is that it explores the possibility of utilizing 
the dictionaries’ property of transitivity. What we mean by this is the following: 
If we have two dictionaries, one translating from language A to language B, the 
other from language B to language C, then we can also translate from A to C by 
using B as the pivot, interlingua or intermediate language. That is, the property of 
transitivity, although having some limitations due to ambiguity problems, can be 
exploited for the automatic generation of a raw dictionary with mappings from A 
to C. One might consider this unnecessary as our corpus-based approach already 
allows us to generate such a dictionary with even higher accuracy directly from the 
respective comparable corpora. 

However, this implies that we have now two ways to generate a dictionary for 
a particular language pair, which means that in principle we can validate one with 
the other. Furthermore, given several languages, there is not only one method to 
generate a transitivity-based dictionary for A to C, but there are several. This means 
that by increasing the number of languages we also increase the possibilities of 
mutual cross-validation. While this is still work in progress, we can present here an 
evaluation of the results that can be expected when constructing a dictionary using 
a single interlingua, and compare it with the results obtained without the use of an 
interlingua. Our evaluation gives exact quality measures for six language directions. 



2 Approach 

As mentioned above, we assume that across languages there is a strong correla- 
tion between the co-occurrences of words that are mutual translations. It is further 
assumed that there is a small dictionary available at the beginning, and that our aim 
is to expand this base lexicon. Using a corpus of the target language, we first com- 
pute a co-occurrence matrix whose rows are all word types occurring in the corpus 
and whose columns are all target words appearing in the base lexicon. We then apply 
an association measure on this co-occurrence matrix, namely the log-likelihood ratio 
(Dunning, 1993). Next, we select a word of the source language whose translation is 
to be determined. Using our source-language corpus, we compute a co-occurrence 
vector for this word, and we also apply the association measure to it. After this, we 
translate all known words in this vector to their corresponding form in the target 
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language. This is done via the base lexicon. In the case of ambiguous words, we use 
the primary translation, i.e., the one that is listed first in the lexicon. Since our base 
lexicon is small, only some of the translations are known. All unknown words are 
discarded from the vector. The entries of the target language vector are then sorted 
according to their association strengths. We keep only the 30 strongest associations 
and eliminate all others. 

As a result, we have now a vector of the source language word that comprises 
those 30 of its top associations that could be translated using the base lexicon. Dur- 
ing the next step, for each word in the target language vocabulary (comprising all 
words of the target language corpus with a frequency of 100 or higher) the ranks 
of these 30 translations are determined, and the product of their ranks is computed. 
The word with the smallest value of the product is considered to be the translation 
of the source language word. 

This algorithm turned out to be a significant improvement over the previous 
one described in Rapp (1999). It provides better accuracy and considerably higher 
robustness with regard to sampling errors. The reason for the improvement appears 
to be that outliers and function words, which may have a negative effect on the 
results, are usually not among the top 30 associations, hence they do not have any 
impact, at least not on the side of the source language. 

The exploration of transitivity (see Sect. 1) was conducted as follows: Using the 
improved algorithm we start by generating a dictionary that translates the test words 
from the source language to the interlingua. Next, we translate the resulting word 
list from the interlingua to the target language. Finally, the outcome is compared to 
our gold standard. As the interlingua-approach is based on a two stage-process with 
errors cumulating, the results can be expected to be worse than for direct translation. 
We nevertheless believe in the virtues of this approach as there are numerous ways 
of choosing the interlingua. 



3 Language Resources 

Three languages were considered, namely English, French, and German, and all six 
language pairs that can be derived from these. To conduct the simulation, a number 
of resources were required: 

1 . Monolingual corpora for each of the three languages 

2. A number of word equations English - French - German to be used as a gold 
standard for evaluating the results 

3. Small base dictionaries for each of the six language pairs 

For German, we used a corpus of 135 million words of the newspaper Frank/urrer 
Allgemeine Zeitung (1993-1996). For English we relied on a corpus of 163 million 
words of The Guardian (1990-1994), while for French only a small set of news- 
paper corpora was available to us. This is why we acquired a corpus comprising 
the French version of Wikipedia and ABU - La Bibliothque Universelle (together 
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about 70 million words). For each corpus, a specific cleanup-program was written 
and applied. 

Since these corpora are relatively large, to save disk space and processing time, 
we decided to remove all function words from the texts. This was done on the 
basis of a stopword list of approximately 600 German, another list of about 200 
English, and a third list of about 500 French words. By eliminating function words 
we assumed to loose little information: Function words are often highly ambigu- 
ous, and their co-occurrences are mostly caused by syntactic rather than semantic 
patterns. Since semantic patterns are more reliable than syntactic patterns across 
language families, we hoped that eliminating the function words would increase the 
generality of our method. 

Rapp (1999) used a list of 100 German test words together with their English 
translations as gold standard for testing the results. As this list is rather small, and 
as we also needed French translations, we decided to compile a larger trilingual list 
of test words. For this purpose, we used three editions of Collins Gem Dictionar- 
ies, which are small pocket dictionaries intended for everyday use. We started with 
the German-to-English part of the Collins Gem German Dictionary, which con- 
tains about 20,000 entries. For each German word, we considered only the primary 
English translation, i.e., the one that was listed first. Each of these we looked up 
in the Collins English-to-French dictionary, again only taking primary translations 
into account. Finally, in the same way we looked up the French words in the Collins 
French-to-German dictionary. This way we obtained a large table of word transla- 
tions comprising the following columns: German - English - French - German. We 
eliminated from this table all lines where the German words in the first and fourth 
column differed. From the remaining table of 1,079 words we eliminated the fourth 
column, as it had become redundant. The resulting list of trilingual word equations 
was used as the test set for our evaluations. 

Note that in order to arrive at this test set we used only three of the six language 
pairs, and that the order in which we applied the dictionaries was more or less arbi- 
trary. We had tried other language pairs and other dictionary orders, with somewhat 
different outcomes. We finally decided to choose the current one, as our intention 
had been to arrive at a test set of about 1,000 items. 

The six base lexicons required by our algorithm were also derived from the 
Collins Gem Dictionaries. All multi-word entries were eliminated. Since it would 
not make sense to apply our method to words that are already in the base lexicon, 
we removed all dictionary entries belonging to the 1,079 test words in the source 
language of the respective language pair. 



4 Results 

Based on the algorithm and the corpora described above, we computed the trans- 
lation of each word of the test word list into the other two languages. Flereby, 
for co-occurrence counting a window size of plus or minus two words from the 
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given word was assumed. As function words had been removed beforehand from 
the corpora, and assuming that roughly every second word is a function word, this 
corresponds to a window size of about plus or minus four words in the original text. 
Since our algorithm requires relatively time consuming computations for each word 
in the target vocabulary, we decided to take into account only words with a cor- 
pus frequency of at least 100 in order to save processing time. As our corpora are 
rather large, this threshold leaves almost all common words in the vocabulary, while 
eliminating most misspelled words. 

Table 1 gives an idea of the system’s performance. It shows the top ten computed 
translations for the following six German words: Historic (history), Leibwachter 



Table 1 Top ten computed translations for six German words 





Historic (history) 




Leibwachter (body guard) 


1 


29,453 


13.73 


history 


1 


949 


40.02 


bodyguard 


2 


4,997 


12.87 


literature 


2 


5,619 


23.34 


policeman 


3 


4,758 


8.74 


historical 


3 


2,535 


8.18 


gunman 


4 


2,670 


0.67 


essay 


4 


26,347 


3.69 


kill 


5 


6,969 


0.11 


contemporary 


5 


9,180 


2.92 


guard 


6 


18,909 


-1.72 


art 


6 


401 


-0.56 


bystander 


7 


18,382 


-2.81 


modern 


7 


815 


-1.24 


POLICE 


8 


15,728 


-4.31 


writing 


8 


8,503 


-2.33 


injured 


9 


1,447 


-5.52 


photography 


9 


2,973 


-3.23 


stab 


10 


2,442 


-5.53 


narrative 


10 


1,876 


-3.58 


murderer 




Raumfahre (space 


shuttle) 




Spirituell (spiritual) 




1 


1,259 


46.20 


shuttle 


1 


2,964 


56.10 


spiritual 


2 


666 


26.25 


Nasa 


2 


1,380 


8.34 


Christianity 


3 


473 


25.95 


astronaut 


3 


7,721 


8.08 


religious 


4 


287 


25.76 


spacecraft 


4 


9,525 


4.10 


moral 


5 


1,062 


16.92 


orbit 


5 


1,414 


0.63 


secular 


6 


16,086 


11.72 


space 


6 


5,685 


0.06 


emotional 


7 


525 


9.50 


manned 


7 


4,678 


-1.04 


religion 


8 


125 


7.69 


cosmonaut 


8 


6,447 


-1.49 


intellectual 


9 


254 


5.24 


mir 


9 


8,749 


-2.25 


belief 


10 


7,080 


3.70 


plane 


10 


8,863 


-4.07 


cultural 




Ukrainisch (Ukrainian) 




Umdenken (rethink) 


1 


1,753 


50.69 


Ukrainian 


1 


1,119 


20.76 


rethink 


2 


22,626 


39.88 


Russian 


2 


248 


15.46 


reassessment 


3 


3,205 


29.25 


Ukraine 


3 


84,109 


13.39 


change 


4 


34,572 


23.63 


Soviet 


4 


12,497 


12.13 


reform 


5 


978 


21.13 


Lithuanian 


5 


236 


10.00 


reappraisal 


6 


1,005 


18.88 


Kiev 


6 


9,220 


9.97 


improvement 


7 


10,968 


15.07 


Gorbachev 


7 


5,212 


9.48 


implement 


8 


10,209 


14.51 


Yeltsin 


8 


1,139 


8.25 


overhaul 


9 


16,616 


13.38 


republic 


9 


13,550 


7.89 


unless 


10 


502 


11.71 


Latvian 


10 


9,807 


7.88 


immediate 
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(bodyguard), /^awm/a/tre (space shuttle), spirituell (spiritual), ukrainisch (Ukranian), 
and umdenken (rethink). The columns have the following meanings: 

1 . Rank of a potential translation 

2. Corpus frequency of translation 

3. Score assigned to translation (the larger the better) 

4. Computed translation 

The results for the other five language pairs, which due to space constraints can- 
not be shown here, are of roughly comparable quality. If we look at the table, we see 
that a correct translation is usually ranked first, and that typical associations follow. 
This behavior can be expected from our associationist approach. 

To get a better picture of the quality of the results, we also conducted a quan- 
titative evaluation. For all 1,079 test words we checked whether the predicted 
translation (first word in the ranked list) was identical to our expected translation 
(as taken from the word equations used as our gold standard). In the case of the lan- 
guage pair German to English this was true for 512 of the 1,079 test words, which 
corresponds to an accuracy of 47.5%. Note that this is a rather conservative assess- 
ment of the quality, as our measure requires string identity and therefore has no 
tolerance. For example, correct alternative translations (e.g., wad instead of street 
for Strafie) or inflected forms of the expected translation are counted as mistakes. 
The following table gives analogous results for all six language pairs: 



German ^ English 47.5% English ^ German 35.7% 
German — > Erench 21.2% Erench ^ German 21.7% 
Erench ^ English 30.1% English ^ Erench 34.9% 



As the results vary quite a bit, the question arises how to explain the differences. 
Here are some attempts: On one hand, our corpus of French is considerably smaller 
than our corpora of English and German (70 vs. about 150 million words) and a 
different genre (encyclopedia rather than newspaper). On the other hand, French 
and German are highly inflectional languages, whereas English is not. So the risk of 
selecting an inflectional variant of the expected translation (which would be counted 
as incorrect) is lower in English. Another consideration concerns the degree of 
relatedness of two languages. Whereas French is a typical Romance language and 
German a typical Germanic language, English lies somewhere in between. From this 
point of view it can be expected that the language pairs involving English achieve 
the best results, which is confirmed by the table. 

With regard to the interlingua approach, the following table shows quantita- 
tive results for the six possible language triplets as obtained using the algorithm 
described in Sect. 2. Whereas the performance figures without interlingua had been 
between 21% and 48%, the figures here vary between 11% and 25%, i.e., they are 
at about half of this level which is clearly better than what could be expected in 
the case of statistical independence. This gives rise to the hope that at some point 
it may be possible to obtain significantly improved results by combining several 
dictionaries generated via different interlinguae. 
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German ^ French — > English 1 1.4% English ^ French ^ German 13.5% 
German ^ English — > French 24.7% French ^ English ^ German 16.2% 
Erench ^ German ^ English 16.3% English ^ German ^ French 13.4% 



5 Discussion and Future Work 

In this paper we made an attempt to solve the difficult problem of identifying word 
translations on the basis of more or less unrelated monolingual corpora of different 
languages. We applied the same algorithm to six language pairs and - using a rather 
conservative automatic evaluation measure that is based on 1,079 test words - we 
achieved accuracies in a range between 21% and 48%. We showed that the algorithm 
can be extended towards an interlingua approach that makes it possible to construct 
a dictionary for a particular language pair via several interlinguae, thereby opening 
up the possibility of improving the results through mutual cross-validation. What 
we suggest for future work is to perform a complete cross-validation that ranks 
each dictionary entry according to the number of successful cross-validations. If 
applicable, the work of a human end validator can be facilitated by providing him 
a ranked list of the translations of a word, ordered according to these ratings. In 
addition, the amount of data to be considered by the validator can be signihcantly 
reduced by introducing a threshold, i.e., by eliminating translations that do not reach 
a certain level. 

To make this feasible, we need large monolingual corpora (if possible from the 
same genre) for as many languages as possible. Well suited for this purpose would 
be, for example, the Gigaword Corpora from the Linguistic Data Consortium, which 
are billion word newsticker texts that are available (though at substantial cost) for 
Arabic, Chinese, English, French, and Spanish. 

Other possibilities for improvement include pre-processing of the corpora and 
bootstrapping of the base lexicon. Pre-processing depends on the tools that are 
available for the respective languages. For example, a lemmatizer can convert inflec- 
tional variants to their respective base forms, which should significantly reduce 
the problem of data sparseness. Alternatively, with a program for word sense dis- 
ambiguation, different senses of a word can be distinguished, and the appropriate 
translations can be determined for each sense of a word. Alternatively, if no dis- 
ambiguator is available, it can be considered to look at co-occurrences between 
sequences of words instead of co-occurrences between single words. The rationale 
behind this is that neighboring words often disambiguate each other, so the word 
sequences are likely to carry less ambiguity than the words. 

By bootstrapping off the base lexicon we mean that the algorithm starts from a 
very small base lexicon, which can then be expanded iteratively. To improve oper- 
ation, those source language words whose associations are covered by the base 
lexicon should be identified systematically, so that their translations can be deter- 
mined first. For such words the likelihood of arriving at a correct translation ought 
to be highest. Once their translations are known, they are added to the base lex- 
icon, and the process is repeated. Every few iterations the existing entries of the 
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base lexicon can be recomputed and revised in order to obtain improved accuracy 
(which gains from the increase in lexicon coverage). Assuming large corpora of 
good quality, it is well possible that this process converges at accuracy levels that 
are significantly better than what we were able to present here. 
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Multilingual Knowledge-Based Concept 
Recognition in Textual Data 



Martin Schierle and Daniel Trabold 



Abstract With respect to the increasing volume of textual data which is available 
through digital resources today, the identification of the main concepts in those texts 
becomes increasingly important and can be seen as a vital step in the analysis of 
unstructured information. 

Research in this area has focused on the detection of named entities like per- 
son names or organization names, which only cover a very small part of concepts 
in texts. Especially the unique mapping between concepts in different languages 
requires parallel corpora, which are rarely available in industrial settings. 

We therefore propose a powerful new knowledge based model to recognize var- 
ious kinds of concepts even in very short and specialized texts using linguistic 
information for synonym handling and word sense disambiguation. 

We evaluate the proposed model on texts from the automotive domain. 

Keywords Entity recognition ■ Multilingual concept recognition • Text mining. 



1 Introduction 

The identification of basic domain concepts in natural language is an important step 
for any deep analysis of textual data. Without such a step any further interpretation 
gets hampered by synonyms and ambiguities. In contrast to the extensive scientific 
work on the extraction of (Named) Entities, the detection of concepts in real appli- 
cations is not restricted to the extraction of person names, organization names or 
locations like countries or cities. There are some main differences: 

1 . Domain specific concepts are not limited to nouns or noun phrases. In principle, 
a user might be interested in nearly any kind of information or fact, for example 
properties such as fast (expressed as adjectives or adverbs) or behaviors such 
as is working (expressed as verbal phrases or verbs). The high specialization of 
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approaches for Entity Extraction limits their usefulness for concept recognition 
in real world applications. 

2. Large companies deal with texts in several or even dozens of languages, which 
requires a concept recognition system to uniquely identify concepts across lan- 
guage borders. This is very difficult, because there is seldom a single direct 
translation between concepts across languages. A word from a source language 
can usually be translated into a whole bunch of words in the target language, 
of which most tend to have slightly different meanings. To identify the cor- 
rect meaning of a word with possibly different meanings we usually rely on the 
context the word appears in. This implies that any system for concept recogni- 
tion must take the context into account while searching for concepts in the text. 
The identification of the same concept in another language may require another 
contextual clue. This problem is not specifically addressed by the researchers 
focusing on multilingual Named Entity Extraction. 

3. In contrast to the vocabulary used in newspaper articles or the world wide web, 
the vocabulary of a certain domain is much more specific and limited. Taking into 
account that for specialized domains extensive dictionaries and resources exist 
already, it is possible to reuse this information for a knowledge based concept 
recognition. 

We will present a knowledge based approach to multilingual hierarchical concept 
recognition and an appropriate data structure to handle synonyms, map concepts 
between languages and with the ability to disambiguate concepts using part-of- 
speech tags and the context the word appears in. 



2 Application in the Automotive Domain 

The concept recognition approach presented in this paper is designed for the analysis 
of automotive repair order texts written by mechanics. For each repair performed, 
the mechanic writes down a short note on the problem experienced by the cus- 
tomer, its cause and of course the work performed to remedy the situation. These 
records serve as a valuable feedback to the car manufacturer as they contain a textual 
description of arbitrary detail. This data is used for quality analysis, data mining and 
early warning tasks. One main goal in using text for quality analysis is to identify 
trends in the distribution of failures based on components and symptoms mentioned 
in the texts. Therefore the correct recognition of component names, failure symp- 
toms and corrective actions are crucial for the task of quality analysis. Based on the 
component names found, trend charts may be obtained by comparing the data to 
previous analysis. However the vast amount of components and symptoms makes it 
hard to focus on all the information extracted at once. A good way to handle the data 
is to use an hierarchical concept structure to start the presentation at the top level 
and leave the user to drill down the hierarchy whenever he needs further details. 

The sole use of technical dictionaries is by far not enough for the recogni- 
tion of concepts, being confronted with synonyms, ambiguities, abbreviations and 
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previously unknown component names. Especially the detection of failure symp- 
toms, which are often expressed as adjectives, verbs or complete phrases, and their 
mapping across language borders requires a very sophisticated approach. 



3 Related Work 

Related work is mainly found in the area of named entity recognition (NER). 
Early rule based approaches to NER are described in Hobbs, Appelt, Tyson, Bear, 
and Israel (1992) and Grishman (1995). Other popular techniques have been used 
such as Decision Trees (Sekine, Grishman, & Shinnou, 1998) or Hidden Markov 
Models (Miller, Crystal, Eox, Ramshaw, Schwartz, et ah, 1998). Eor domain spe- 
cihc areas it could be shown, that knowledge based approaches can combine good 
results with arguable effort and fast runtime (Hanisch, Eundel, Mevissen, Zimmer, 
& Eluck, 2005 or Cohen, 2005). These systems also integrate methods for Word 
Sense Disambiguation (WSD), but none of them is using a sophisticated knowl- 
edge base, like WordNet (Miller, Beckwirth, Eellbaum, Gross, & Miller, 1990) or 
EuroWordNet (Vossen & Letteren, 1997). However WordNet only contains English 
concepts and EuroWordNet lacks a mechanism to resolve ambiguities if a synset in 
one language is translated to different synsets in another language. 

Some additional difficulties in multilingual NER are described in Huang (2005). 
Saito and Nagata (2003) used HMMs to recognize NERs in different languages. 
However they do not establish a direct mapping between the entities found. 
Therefore their approach may not be used to analyze multilingual sources. Most 
approaches that can establish a mapping between named entities across language 
borders require bilingual corpora cf. Huang (2005) and Klementiev and Roth (2006). 

We outline a way to combine these different ideas to uniquely identify corre- 
sponding concepts in different languages using a sophisticated knowledge base. Our 
approach identihes concepts, handles synonyms and disambiguates concepts using 
context keywords and part-of-speech tags. 



4 Requirements 

This section will describe some of the basic requirements we were challenged 
with. Although we encountered them within our domain specific tasks, they rep- 
resent common concept recognition requirements. Our knowledge-based approach 
to master these requirements will be presented afterwards. 

1. The handling of synonyms can be seen as a basic, but crucial task. One concept 
may be expressed in different ways, but it is necessary to identify all these dif- 
ferent expressions as the same concept. With respect to the automotive domain, 
the words car and vehicle should be treated as synonyms as well as for example 
wiring and cabling. 
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Table 1 Domain examples for word sense disambiguation 



Text example 


POS 


Context 


Concept to identify 


Wairanty doesn’t cover shop supplies 


Verb 


Warranty 


- 


Removed valve body and replaced cover 


Noun 


Valve 


Valve cover 


Customer states center seat feels weak and cover 
is loose 


Noun 


Seat 


Seat cover 



2. Identical words may refer to different concepts (ambiguities like homonyms), 
which makes it necessary to disambiguate the word sense. Homonyms may exist 
within the same part-of-speech tag as well as with different tags. Some examples 
are given in Table 1 . 

3. The unique recognition of concepts across language borders is one of the most 
important requirements, especially with regard to our specific goal of quality 
analysis. It is important to emphasize here, that a unique concept mapping may be 
considerably different from the task of machine translation due to the focus on the 
meaning of a word. We are not searching for translations, but for the identification 
of equivalent meanings. The English word memory seat for example refers to a 
seat with the capability to restore a previously memorized position. The German 
language has no equivalent term. In order to express the same idea one may say 
Sitz mit Memory-Funktion, which is a description rather than a translation. 

4. Finally humans tend to use different levels of abstraction to describe things. They 
use more general or more specific terms (like noise and squeak), or talk about 
a specific part {brake pad)ov the whole system {brake system). Therefore our 
approach must be able to handle hyponomy as well as meronymy, which is espe- 
cially important for the analysis of car components. It would be of little use if an 
analyst who is interested in brake related issues or noise problems would need to 
define each time which concepts should be considered during his analysis. 

The following table gives some examples of domain texts that illustrate the different 
usage of the word cover. Note that the meaning of the word depends on the words 
part-of-speech {pos) and context as indicated in the last column. 



5 Data Structure 

WordNet (Miller et al., 1990) uses synsets for synonym handling. A synset is a set 
of one or many words that are interchangeable in some context without changing the 
meaning. In our domain the terms courtesy lamp, dome light and interior light are 
synonyms which will be stored in one synset. Vossen and Letteren (1997) proposed 
to construct separate WordNets for each language and to map the synsets with an 
inter lingual index. This index is basically a list of mappings. However when a 
word has several different meanings, which translate differently, it is not apparent 
which translation to chose. We therefore expand the synsets in two ways. First we 
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ID 









L, 


T„ 


PoS„ 






P°S,„ 


c, 






PoS„ 






P°S,„ 


c, 



'-m 


T, 


PoS„ 




T„ 


P°S,„ 


C, 




T, 


P°s„ 






P°S.n 


C, 



78 









German 


Ventildeckel 


NN 




Deckel 


NN 


Ventil 


Zyllnder- 

kopfdeckel 


NN 





English 


Valve 


NN 


Cover 


NN 




Cover NN 


valve 


Rock 


NN 


Cover 


NN 





Fig. 1 Example for a multi-lingual concept of the polyhierarchical taxonomy 



include the information necessary to disambiguate word meanings. Most important 
for word sense disambiguation is the part-of-speech pos of a word and the context C 
it appears in (see Table 1 for some examples). Second we tightly couple synsets of 
different languages. The resulting structure represents a general concept. Figure 1 
shows our structure of a concept. 

Each concept is identified by a unique id and has entries for one or several 
languages. For each language L we store 1 to synonymous expressions. Each 
expression consists of 1 to « tokens 7)y and pos tags PoSij and one context C, . 
This basic structure enables word sense disambiguation as well as synonym han- 
dling and multilingual concept mapping. Figure 1 shows a concept which is called 
Ventildeckel in German and valve cover in English. In both cases the words need to 
be nouns abbreviated as NN in the figure. The second line for English covers the 
ambiguous word cover which can be identified as a valve cover if the word valve 
appears in the context of the word cover. 

So far the structure enables synonym and ambiguity handling and is also able 
to store translations for concepts to other languages. Synonyms are grouped in one 
concept node while the different meanings of a homonym may be represented using 
different part-of-speech tags, different context definitions or even both. However 
the structure is not yet able to handle hyponymy (sub-term relation) and meronymy 
(part-of relation). We do not distinguish the two semantic relations, assuming that 
the differences with respect to our application can be neglected. The data struc- 
ture addresses these issues by ordering the concepts in a polyhierarchical taxonomy 
(Eig. 2). A concept is a child of another concept, if it is an hyponym or a meronym 
of the other concept. The polyhierarchy is needed to model the fact that side window 
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Fig. 2 Exemplary part of the concept hierarchy for noise modeling in the automotive domain 



is a hyponym of the more general term window and at the same time a meronym of 
the side door. Beside the hyponym and meronym handling the hierarchy has some 
other advantages. Some concepts do not easily translate into other languages. In 
those cases it is possible to create more specific nodes for a single language and 
model it as a subnode of the next general node. While some terms may be so lan- 
guage dependent, that there might be no equivalent term in other languages, we can 
still do language independent analysis on a higher level of the hierarchy. 

Therefore the hierarchy can be divided into a language dependent (mainly 
leaves), and a language independent part (Fig. 2). Words like squeak are very 
problem-specific, and carry a special meaning for a mechanic, which can’t be eas- 
ily translated. High noises on the other hand can be expressed, understood and 
interpreted in all languages. 

The other advantage of the hierarchical layout is that the system that uses such 
a structure will be able to analyze very precise texts as well as more general ones. 
For a general survey the frequencies of the top most terms give a good starting point 
while for detailed investigation the term frequencies at the bottom of the structure 
are the interesting ones. 



6 Concept Recognition 

The concept recognition is based on the data structure described in Sect. 5 and pre- 
processed input data. The preprocessing is done using a workflow based on IBM’s 
UlMA framework (Ferrucci & Tally, 2004). Its basic steps are language identifica- 
tion, tokenization and spelling correction (see Schierle, Schulz, & Ackermann, 2007 
for further details). In addition to Schierle et al. we incorporated a part-of-speech 
tagger into the workflow. 
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6.1 Taxonomy Expansion 

Before the matching takes place, a taxonomy expansion step is applied, which is 
further divided into synonym expansion and context expansion. For the synonym 
expansion of a term f, , the taxonomy is searched for a term tj (with synonyms) 
such that a token-delimited substring s <= ti is one of the synonyms of tj. Then 
we expand the synset f, by creating new synonyms, by replacing the substring s 
from ti by all synonyms of tj . If there are for example the two synsets i’l = (light, 
lamp) and S2 = (fog light) we can expand S2 to = (fog light, fog lamp). The 
expansion may run infinitely if no precautions are taken to avoid loops during the 
replacement. Special attention must be payed to the context while substituting 5 with 
the synonyms of s defined in tj . The substitution may only be done if both ?, and tj 
require no or the same context. 

It is important to be aware of the fact, that this step may lead to synonyms which 
are not common, but it is very unlikely to create synonyms which are erroneous. 
Similar to the synonym expansion we perform a context expansion step by adding 
all alternative contexts, that can be obtained by replacing each word in the context 
with its synonyms defined elsewhere in the taxonomy. The additional synonyms 
created during the expansion are not stored in the taxonomy, they are only derived 
at runtime. These two expansion steps reduce the amount of information that needs 
to be entered manually by the expert significantly. 



6.2 Matching Process 

For an efficient processing a separate trie is built for every language defined in the 
taxonomy, containing all synonyms of all synsets of the language, with one token per 
node. The separate construction of a trie for each language has the further advantage 
that only the information for the languages currently needed needs to be loaded in 
memory. This trie can be efficiently used to find the matching concepts in the input 
text. If a sequence of I to « tokens t\ . . . t„ matches a concept, the sequence of the 
part-of-speech annotations associated with ti . . . t„ is compared with the sequence 
of pos-tags for all tokens of the concept. They need to be identical. Besides the 
pos constraint, the matcher checks the context of either side. If several concepts 
match a given input but they all require a different context, the matcher will choose 
the concept for which the distance between the sequence of matching tokens and 
the required context is minimal. If no such context can be found within a distance 
limited by some threshold 6 the matcher will choose the concept that requires no 
context or will return no match for the input sequence, if such a concept does not 
exist. The threshold 6 is mainly needed to avoid the association of some word with a 
context that has nothing to do with it. We encountered problems during the analysis 
of texts that had no sentence boundaries marked. The threshold may reduce the 
recall but ensures that even if no sentence boundaries are available the precision 
remains high. 
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Note that there may be several potential matches for a given input sequence. 
This represents a very common situation. Potential strategies for the list of matches 
include an enumeration of all matches or the longest match. We opted for the latter. 
Potential matches for the input sequence seat cover may for example include seat, 
cover and seat cover, of which we would only regard seat cover as an appropriate 
match. 



7 Evaluation 

The knowledge base used for the experimental evaluation was constructed using 
automated imports from company sources (technical dictionaries, component hier- 
archies) and some arguable manual work (approx. 4 weeks), and contains around 
2,000 concepts. Although most of the concepts are specified in several languages 
(at least German and English), only the English definitions have been thoroughly 
tested. To reduce the evaluation effort we evaluate the system on English data only. 
As the multilingual capabilities are accomplished by the knowledge-base structure 
itself and are just as good as the information maintained in it, we assume that an 
English evaluation on Entity Recognition and Word Sense Disambiguation is rep- 
resentative. We used the recall and precision measures as defined in Manning and 
Schuetze (1999), which are commonly used for text mining tasks. 

To examine the word sense disambiguation performance of the system, we man- 
ually reviewed 100 concepts which were disambiguated using pos-tags, and another 
100 concepts which were disambiguated using context keywords. The pos-tags 
where obtained by an Hidden-Markov-Model based pos-tagger, which was trained 
on approx. 26,000 manually annotated words from our domain, and yields an 
accuracy of 92.2%. Both approaches achieved good results: The pos-tag based dis- 
ambiguation showed an accuracy of 91%, the context based evaluation an accuracy 
of 92%. In contrast to that a simple baseline method using always the most frequent 
meaning was evaluated. The baseline showed an accuracy of 82% related to pos- 
tag based ambiguities, and an accuracy of 52% for the context-based ambiguities, 
which is significantly worse. 

Figure 3 shows an exemplary evaluation for the word-sense-disambiguation 
using the context. We tried to disambiguate the word band by means of the contexts 
of its different meanings am band, fin band and radio band for the more general 
concept. The figure shows, that best results in terms of the F-measure are achieved 
with a context size of approx, five words, which slightly differs from other concepts. 
At a first glance it may be surprising that a larger context can actually decrease 
the performance. Intuitively a larger context should provide more information and 
should therefore improve the results. We found that missing sentence boundaries in 
the repair text are responsible for the performance degradation. 

To evaluate the systems overall Entity Recognition performance, we applied it to 
100 English repair order documents, which where manually reviewed. The experi- 
mental results show a very high quality of the recognized concepts with a precision 
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Context size (words) 



Fig. 3 Influence of the context size for word-sense-disambiguation 



of 97.9%. The system achieves a recall of 81.6%, which states that a high portion of 
concepts that are meaningful for our analysis is found. The recall is directly related 
to the filling level of the knowledge base, and can therefore easily he increased. 



8 Conclusion and Future Work 

We presented in this paper a knowledge based approach to multilingual and hierar- 
chical concept recognition, including appropriate data structures and algorithms to 
handle synonyms as well as ambiguities and hyponyms. 

We conclude that especially in very specific domains a knowledge based approach 
can be realized with justifiable human effort and high quality results. The system is 
currently used to process English and German documents for quality analysis in the 
automotive domain with satisfying results and performance. Our future work will 
focus on the integration of additional background knowledge into the taxonomy, 
like attributes and relations, and the integration of the taxonomy into quality analy- 
sis processes. This will involve further research on how to include the taxonomy in 
association rules, decision trees and early warning algorithms. 
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A Diversified Investment Strategy Using 
Autonomous Agents 



Rui Pedro Barbosa and Orlando Belo 



Abstract In a previously published article, we presented an architecture for imple- 
menting agents with the ability to trade autonomously in the Forex market. At the 
core of this architecture is an ensemble of classification and regression models 
that is used to predict the direction of the price of a currency pair. In this paper, 
we will describe a diversified investment strategy consisting of five agents which 
were implemented using that architecture. By simulating trades with 18 months of 
out-of-sample data, we will demonstrate that data mining models can produce prof- 
itable predictions, and that the trading risk can be diminished through investment 
diversification. 

Keywords Intelligent Agents ■ Forex Trading. 



1 Introduction 

Due to the growing interest in quantitative and algorithmic trading, there has been 
an increasing number of studies regarding the use of classification and regression 
models in the prediction of financial time series. Different approaches have been 
considered, from categorization of press releases using support vector machines 
for stock trading (Mittermayer, 2004), to using artificial neural networks to pre- 
dict exchange rates (Yao & Tan, 2000). In a previous study (Barbosa & Belo, 2008) 
we presented our own approach to this problem, with the description of an archi- 
tecture that can be used to implement autonomous Forex trading agents. The most 
important part of this architecture is an ensemble of classifiers and regression mod- 
els, which continuously tries to predict the direction of the price of a currency pair. 
In this paper, we will describe the implementation of five different agents using this 
architecture. Each agent will be responsible for trading a different currency pair, 
with a 6-h timeframe. The traded pairs will be the EUR/USD, USD/JPY, EUR/JPY, 
USD/CHF and EUR/CHE. We intend to use the agents’ trading performance to 
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demonstrate two things. First, that classifiers and regression models can produce 
profitable predictions, thus providing further empirical evidence that some data min- 
ing techniques can be useful in the development of successful trading strategies. 
Second, that we can lower the trading risk through diversification, by making the 
agents share the monetary resources. 



2 Agents’ Implementation 

The architecture for implementing autonomous trading agents is composed of three 
modules (Barbosa & Belo, 2008), as shown in Fig. 1. The first module, named Intu- 
ition Module, is responsible for predicting the direction of the price of a currency 
pair. At the core of this module is an ensemble of classification and regression mod- 
ules, which tries to predict if the price of a currency pair will increase or decrease 
in the near future. The second module, named A Posteriori Knowledge Module, is 
responsible for handling money management. It consists of a case-based reasoning 
system, which uses information collected from previous trades to decide how much 
to invest in the next trade. The last module is called A Priori Knowledge Module, and 
is implemented with a rule-based expert system. Rules with domain specific knowl- 
edge can be inserted in this system, in order to optimize the agents’ profitability. This 
module is responsible for managing risk and making the final trading decisions. 

Each module makes a different contribution to the agents’ performance. In 
order to demonstrate this difference, we will describe the agents’ implementation 




Fig. 1 Architecture for implementing trading agents 
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module by module, and compare the simulated trading results for different module 
combinations using 18 months of out-of-sample data. 



2.1 Prediction Mechanism 

The Intuition Module is an essential part of the agents’ trading strategy. This module 
is responsible for predicting the direction of the price of a currency pair in the near 
future, so that a trade can he opened accordingly. The predictions are made periodi- 
cally by an ensemble of classification and regression models. Given that the agents 
will trade with a 6-h timeframe, their models need to make four predictions a day: at 
midnight GMT they predict where the price will he at 6 a.m., at 6 a.m. they predict 
where the price will he at 12 noon, and so forth. While the classification models 
try to predict the next price class (“the price will increase in the next six hours” or 
“the price will decrease in the next six hours”), the regression models try to predict 
the price return over the next six hours. This return is then converted to one of the 
classes, so that the ensemble prediction can be calculated. 

In order to ensure the agents’ autonomy, the Intuition Module performs two tasks 
before each prediction: 

• Retrains all the models with new data; however, a retrained model only replaces 
its older version in the ensemble if simulation results show that it would have 
been at least as profitable as the older version in the last 100 trades. 

• Sets the weight of the vote of each model according to its profitability in the most 
recent 100 trades. 

Once the Intuition Module finishes updating the models and their votes’ weights, 
the prediction for the next six hours period can be performed. Each of the models in 
the ensemble makes its prediction, and these predictions are aggregated by adding 
the weights of the votes of all the models that predict “the price will increase in the 
next six hours” and then subtracting the weights of the votes of all the models that 
predict “the price will decrease in the next six hours”. If the result is greater than 
zero, the ensemble prediction is that the price will increase, otherwise if it is lower 
than zero, the final prediction is that the price will decrease in the next six hours. A 
trade can then be opened accordingly: the agent buys the currency pair if it predicts 
its price will increase, or short sells it if it predicts a price drop. 

To implement each agent’s Intuition Module, we used the Weka API (http: 
//www.cs. Waikato. ac.nz/ml/weka/) to train several models with 4,000 instances, cor- 
responding to the period from May 2003 till December 2006. Among the attributes 
used for training were: the hour (0, 6, 12 or 18) and day of the week the prediction 
was made; the current class (“the price increased in the last six hours” or “the price 
decreased in the last six hours”); the close price; the price return over the current 
6-h period; lagged price returns (LAG); price return moving averages (MA); the rel- 
ative strength index (RSI); the Williams %R (WIL); and the rate of change (ROC). 
After training, each model was tested with out-of-sample data corresponding to the 
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Table 1 Ensemble of the EUR/USD trading agent 



Model 


Attributes 


Prediction 


NaiveBayes 


Hour, day, return 


Next class 


LibSVM 


Hour, day (num), MA(2), RSI(1 1), ROC(12) 


Next return 


SimpleCart 


Hour, day, LAG(2), RSI(2), ROC(2), ROC(5) 


Next class 


LibSVM 


Hour, day, MA(6), MA(4), MA(3), return 


Next return 


LeastMedSq 


hour, day, LAG(5), LAG(4), LAG(3), LAG(2), 






LAG(l), return 


Next return 


KStar 


Hour, day, class 


Next return 


RBFNetwork 


Return, hour (num), day, MA(12), ROC(4) 


Next class 



Table 2 Ensemble of the EUR/JPY trading agent 


Model 


Attributes 


Prediction 


IBl 


Class, hour, day (num), MA(8), MA(12), RSI(15) 


Next class 


LibSVM 


Class, hour (num), day, LAG(7), WIL(25), ROC(7) 


Next return 


SimpleCART 


Hour (num), day (num), MA(8) 


Next class 


PART 


Class, hour, day (num), RSI(21), ROC(7) 


Next class 


KStar 


Class, hour, day (num), WIL(ll), ROC(7) 


Next return 


KStar 


Class, hour, day (num), MA(8), RSI(12), RSI(20) 


Next class 


LibSVM 


Hour (num), day, MA(7) 


Next return 



Table 3 Ensemble of the EUR/CHE trading agent 



Model 


Attributes 


Prediction 


JRip 


Return, hour (num), day, WIL(24), RSI(27), RSI(39), SRSI(8) 


Next class 


Logistic 


Class, hour (num), day, RSI(18), RSI(28) 


Next class 


SimpleCART 


Class, close price, hour (num), day (num), LAG(5), SRSI(7), 






ROC(20), ROC(21) 


Next class 


LibSVM 


Return, RSI(9), RSI(25) 


Next return 


KStar 


Class, MA(ll), LAG(l), WIL(8), ROC(l) 


Next return 


RBFNetwork 


Class, hour (num), WIL(17), RSI(23), ROC(3) 


Next return 


J48 


Class, hour (num), day, LAG(l), WIL(1 1), RSI(14) 


Next class 



month of January 2007. In order to avoid redundant predictions, we used the test 
results to select, for each ensemble, the group of models with the most heteroge- 
neous predictions, by discarding the models that consistently predicted the same 
price movements. Tables 1-5 describe the final makeup of the agents’ ensembles. 
In order to examine the predictive ability of these ensembles, we implemented the 
agents using only the Intuition Modules, as shown in Fig. 2. 

The five agents were tested with 1 8 months of out-of-sample data, corresponding 
to the period from February 2007 till July 2008. To facilitate the interpretation of 
the results, we will consider that each agent used a trade size corresponding to a pip 
value of $10. A pip is the smallest possible change in the price of a currency pair 
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Table 4 Ensemble of the USD/JPY trading agent 



Model 


Attributes 


Prediction 


KStar 


Hour, day, MA(6), class 


Next class 


J48 


Hour, day, MA(6), class 


Next class 


JRip 


Hour, day, class 


Next class 


NaiveBayes 


Hour, day, return 


Next class 


LMT 


Hour, MA(6), class 


Next class 


KStar 


Hour, day, MA(6), class 


Next return 


LibSVM 


Hour (num), day (num). 






MA(IO), MA(2), return 


Next return 



Table 5 Ensemble of the USD/CHF trading agent 



Model 


Attributes 


Prediction 


RBFNetwork 


Hour (num), LAG(6), SWIL(6), SWIL(34) 


Next class 


IBk 


Hour (num), WIL(24), LAG(l), LAG(6) 


Next return 


PaceRegression 


Hour (num), day (num), MA(4), LAG(4) 


Next return 


LibSVM 


Hour, LAG(4), WIL(7), WIL(23) 


Next class 


IBl 


Close price, hour, day, LAG(6), ROC(35), 






WIL(31), SRSI(22) 


Next class 


Simple 


Class, hour (num), WIL(14) 


Next class 


LibSVM 


Hour, day (num), MA(ll), RSI(2), RSI(29) 


Next return 
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Fig. 2 Agent implementation using the Intuition Module 



(0.01 for the Japanese yen pairs, 0.0001 for all the others). At current exchange rates, 
to achieve that value per pip, we would need a trading account with $148,000 to 
place a trade using the EUR/USD currency pair, $163,000 for the EUR/JPY and the 
EUR/CHE pairs, and $1 10,000 for the USD/JPY and the USD/CHF pairs. In order to 
verify if investment diversification can be used to implement a safer trading strategy, 
we also tested the scenario in which the agents share the monetary resources. That 
means each agent used a trade size corresponding to $2 per pip, which would require 
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- -EUR/JPY - -USD/JPV USD/CHF — EUR/CHF — EUR/USD Diversified 

10000 




Fig. 3 Simulation results using the predictions of the Intuition Modules 



around $139,000 to accommodate for the simultaneous opening of the five trades. 
Figure 3 and Table 6 show the results of using the modules’ predictions to simulate 
trades in the out-of-sample period. An expensive cost of five pips per trade was 
taken into consideration, thus the simulation results should be similar to the results 
that would have been obtained trading live. 

At first sight, it seems our prediction mechanisms are useless. All the agents 
ended up losing money after 18 months of trading. However, if we break down the 
results, we can verify that the Intuition Modules worked as expected. They did out- 
put profitable predictions. The problem was that too many trades were opened, and 
a lot of money was wasted with commissions. For example, the EUR/CHF agent 
lost $56,865 after 1,647 trades. At a cost of five pips per trade, each pip being worth 
$10, that means it spent $82,350 in commissions. Therefore, if we do not take into 
account the trading costs, it actually won $25,485. It is obvious that something needs 
to be done regarding the agents’ excessive number of trades. 



2.2 Money Management Using Empirical Knowledge 

The A Posteriori Knowledge Module was created to increase the agents’ profit with- 
out a proportional increase in the risk, which would occur if we simply incremented 
the trade size. This module uses a case-based reasoning system to predict how prof- 
itable a trade will be before it is opened, so that the amount to invest in the trade 
can be set accordingly. In order to accomplish this, the system stores the predictions 
of all the models in the ensemble and the corresponding results of all the trades the 
agent did in the past. When a new ensemble prediction is generated, a search is made 
for all the trades in the system with the same models’ predictions. The overall profit 
of the retrieved cases is then calculated. If it is considered high enough, the amount 
to invest in the trade is doubled, and if it is considered too low, the agent does not 
open the trade. We implemented the five agents using a combination of the Intuition 
Module and the A Posteriori Knowledge Module, as represented in Fig. 4. 
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Table 6 Simulation results using different module combinations 



Pair 


Modules 


Accuracy (%) 


Success (%) 


Trades 


Max DD ($) 


Profit ($) 


EUR/USD 


Intuition 


53.14 


53.14 


1,671 


44,165 


-39.875 




Intuition + 
APosteriori 


54.60 


54.60 


1,229 


23,830 


-13.610 




Intuition + 
APriori 


53.46 


58.49 


1,015 


16,445 


-1.285 




Agent 


54.95 


60.20 


1,014 


13,050 


7,080 


EUR/JPY 


Intuition 


54.90 


54.90 


1,683 


29,375 


-21,585 




Intuition + 
APosteriori 


55.96 


55.96 


1,099 


21,400 


21,730 




Intuition + 
APriori 


55.20 


55.81 


609 


9,550 


46,440 




Agent 


56.16 


56.72 


799 


16,170 


52,200 


EUR/CHF 


Intuition 


53.67 


53.67 


1,647 


56,915 


-56,865 




Intuition + 
APosteriori 


55.53 


55.53 


1,212 


28,240 


-26,475 




Intuition + 
APriori 


53.42 


58.52 


954 


9,820 


-2,915 




Agent 


55.32 


60.90 


923 


6,610 


12,185 


USD/JPY 


Intuition 


54.18 


54.18 


1,639 


26,335 


-25,900 




Intuition + 
APosteriori 


56.35 


56.35 


1,173 


11,380 


28,050 




Intuition + 
APriori 


54.06 


54.19 


710 


6,630 


27,785 




Agent 


56.36 


56.36 


874 


10,130 


44,945 


USD/CHE 


Intuition 


53.46 


53.46 


1,678 


34,320 


-31.235 




Intuition + 
APosteriori 


55.44 


55.44 


1,048 


25,550 


1,440 




Intuition + 
APriori 


53.85 


58.55 


928 


7,480 


8,020 




Agent 


55.88 


60.10 


879 


12,750 


10,295 


Diversified 


Intuition 


53.87 


53.87 


8,318 


36,310 


-35.092 




Intuition + 
APosteriori 


55.56 


55.56 


5,761 


10,646 


2,227 




Intuition + 
APriori 


53.89 


57.40 


4,216 


3,770 


15,609 




Agent 


55.70 


58.96 


4,489 


4,034 


25,341 



The simulation results for these agents are shown in Fig. 5, and summarized in 
Table 6. There was a considerable improvement in the trading performances. This 
progress derives from the agents’ new found ability to invest more capital when 
they expect a trade will be more profitable, and to skip trades they predict will be 
unprofitable. This allows them to trade less frequently, which means less money is 
spent in commissions and other trading costs. The results also show that diversifying 
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Fig. 4 Agent implementation using the Intuition and A Posteriori modules combination 
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Fig. 5 Simulation results using the Intuition and A Posteriori modules combination 



the investment will, as expected, decrease the risk. This decrease can be seen in the 
lower maximum drawdown of the diversihed strategy. Simply put, the maximum 
drawdown measures the maximum accumulated trading loss experienced in the past. 
It can be easily spotted in the accumulated profit charts by looking for the biggest 
“peak to valley” decline. The fact that the diversified strategy has the smoothest 
profit curve also corroborates the idea that we should use it to trade our capital, in 
lieu of using a single risk-prone agent. 



2.3 Risk Management Using Domain Knowledge 

The A Priori Knowledge Module has two objectives: lowering the agents’ maximum 
drawdown and eliminating redundant trades. This module consists of a rule-based 
expert system, in which the following rules were inserted: 
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Fig. 6 Agent implementation using the Intuition and A Priori modules combination 




• Do not trade if it is Christmas Day, New Year’s Day or Good Friday. 

• Only open a trade if another trade with the same direction and size is not already 
open; if that is the case, just leave the old trade open. 

• Close a trade if the price moves by a certain percentage in the predicted direction: 
0.20% for the EUR/USD, 0.60% for the EUR/JPY, 0.15% for the EUR/CHE, 
0.70% for the USD/JPY and 0.25% for the USD/CHE. These take-proht targets 
were specified according to each pair’s price volatility. 

We implemented the agents using the combination of the prediction module with 
the A Priori Knowledge Module, as shown in Eig. 6. The rule-based expert system 
was created with the JBoss Drools Engine (http://www.jboss.org/drools/). 

The results for these agents, shown in Eig. 7 and Table 6, demonstrate that the A 
Priori Knowledge Modules were able to decrease the agents’ total number of trades 
and maximum drawdown. There was also a substantial increase in the percentage 
of successful trades, i.e., trades that were profitable. The reason for this is simple. 
Previous agents’ implementations would only close a trade when the 6-h period 



348 



R.P. Barbosa and O. Belo 



ended, and a new trade was about to be opened. Therefore, a trade was successful 
only if the prediction of the direction of the currency price was accurate. The expert 
system’s take-profit rules, on the other hand, allow a trade to be closed with profit 
even if the prediction is incorrect, as long as the price moves hy a certain percentage 
in the predicted direction during the 6-h period. 



2.4 Agents’ Results 

The actual autonomous trading agents were implemented using the architecture rep- 
resented in Fig. 1, which is thoroughly described in another article (Barhosa & 
Belo, 2008). The simulation results, shown in Fig. 8 and Table 6, reveal that the 
agents were able to combine the A Priori Knowledge Module ’s ability to decrease 
the drawdown with the A Posteriori Knowledge Module’s ability to increase the 
proht. All the agents were profitable during the simulation period, when trading sep- 
arate accounts. The results of the diversihed investment strategy were also positive. 
This strategy achieved an overall prediction accuracy of 55.70%, with 58.96% of the 
trades being successful. The total profit was $25,341, with a comparatively small 
maximum drawdown of $4,034. Had the agents started with an initial investment 
capital of $139,000 that would have meant a profit of around 18% after 18 months 
of trading. This is an acceptable performance, in particular because an expensive 
trading cost of hve pips per trade was used in the calculations. This cost should be 
substantially lower in real life trading. 

Using the EUR/JPY agent to invest all our trading capital would have yielded 
the most proht during the simulation period. However, there are a couple of reasons 
why, for live trading, we should use the diversihed investment strategy instead. First 
of all, the fact that it has the lowest maximum drawdown means it is relatively safer. 




Fig. 8 Agents’ simulation results 



A Diversified Investment Strategy Using Autonomous Agents 



349 



But even more importantly, this strategy has the highest profit-to-drawdown ratio. 
In practical terms, this means we should be able to increase the profits substantially, 
without a significant increase in the risk, by making the agents use a reasonable 
amount of leverage, i.e., trade with borrowed funds. 



3 Final Remarks 

While our agent-based diversified investment strategy achieved interesting results, 
its performance could still use some improvement. One way to further optimize it 
would be to decrease the overall number of trades done by the agents, which is still 
too high. If we integrate the agents in a multiagent system, and make them negotiate 
among themselves before opening trades, it should be possible to eliminate a lot of 
redundant trades. For example, buying $1 of the EUR/USD currency pair and $1 
of USD/JPY pair yields the exact same result as simply buying $1 of the EUR/JPY 
pair. Therefore, in this particular situation, two trades could be replaced by just one. 
The implementation of a multiagent system that is able to eliminate this type of 
redundant trades will be the subject of our future work. 
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Classification with Kernel Mahalanobis 
Distance Classifiers 



Bernard Haasdonk and Elzbieta P^kalska 



Abstract Within the framework of kernel methods, linear data methods have 
almost completely been extended to their nonlinear counterparts. In this paper, 
we focus on nonlinear kernel techniques based on the Mahalanobis distance. Two 
approaches are distinguished here. The first one assumes an invertible covariance 
operator, while the second one uses a regularized covariance. We discuss concep- 
tual and experimental differences between these two techniques and investigate their 
use in classification scenarios. For this, we involve a recent kernel method, called 
Kernel Quadratic Discriminant and, in addition, linear and quadratic discriminants 
in the dissimilarity space built by the kernel Mahalanobis distances. Experiments 
demonstrate the applicability of the resulting classifiers. The theoretical consid- 
erations and experimental evidence suggest that the kernel Mahalanobis distance 
derived from the regularized covariance operator is favorable. 

Keywords Kernel methods ■ Mahalanobis distance ■ Quadratic discriminant. 



1 Introduction 

Nonlinear learning methods can be successfully designed by linear techniques in 
feature space induced by kernel functions. Many of such kernel methods have been 
proposed so far, including Support Vector Machine (SVM) and Kernel Fisher Dis- 
criminant (KFD) (Mika, Ratsch, Weston, Scholkopf, & Muller, 1999). They have 
been widely applied to various learning scenarios thanks to their flexibility and 
good performance (Scholkopf & Smola, 2002; Shawe-Taylor & Cristianini, 2004). 
In this paper, we consider a nonlinear kernel technique, the kernel Mahalanobis dis- 
tance, which represents a kernel quadratic analysis tool. Two approaches to kernel 
Mahalanobis distance are distinguished and investigated here. The first one assumes 
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invertible class covariance matrices in the kernel-induced feature space and is sim- 
ilar to the method discussed in Ruiz and Lopez-de Teruel (2001), while the other 
one regularizes them appropriately. As a result, these different assumptions lead 
to different formulations of kernel Mahalanobis classifiers. The goal of the current 
presentation is to compare these two approaches theoretically and experimentally. 
For the experiments we use different classifiers built on these kernel Mahalanobis 
distances. First, we use Kernel Quadratic Discriminant (KQD) analysis (P^kalska & 
Haasdonk, 2009). We also train classifiers in simple dissimilarity spaces (Pgkalska 
& Duin, 2005) defined by the class-wise kernel Mahalanobis distances. In this way, 
we make an explicit use of the between-class information, which may also lead to 
favorable results. Our approach KQD is a pure kernelized algorithm and differs from 
the two-stage approach (Wang, Plataniotis, Lu, & Venetsanopoulos, 2008) which 
relies on supervised dimension reduction in a kernel-induced space followed by a 
quadratic discriminant analysis. 

The paper is organized as follows. Section 2 starts with preliminaries on kernels. 
Section 3 introduces the kernel Mahalanobis distances and subsequent classification 
strategies. Section 4 presents an experimental study on the kernel Mahalanobis dis- 
tance classifiers on toy and real world data. Section 5 gives some theoretical insights 
and we conclude with Sect. 6. 



2 Kernels and Feature- Space Embedding 

Let A” be a set of objects, either a vector space or a general set of structured objects. 
Let (j)\ X — be a mapping of patterns from A” to a high-dimensional or infinite 
dimensional Hilbert space Ti. with the inner product (•,■). 

We address a c-class problem, given by the training data X := C X 

with labels C where := {o)i, . . . ,C0c] is a set of c target classes. 

Let <J) := [(j){x{), . . . be the sequence of images of the training data 

X in Ti. Given the embedded training data, the empirical mean is defined as 
4>iJ. •= where 1„ is an «-element vector of all ones. 

Here and in the following we will use such matrix-vector-product notation involv- 
ing for both finite and infinite dimensional H which is reasonable by suitable 
interpretation as linear combinations in TC. The mapped training data vectors are 
centered by subtracting their mean such that (^(x,) := (p{Xi)—(j)^j^, or, more com- 
pactly, I) := [0(xi), . . . ,(^(x„)] = 4> - = <I) - Here, 

H := /„ — is the n x n centering matrix, while /„ is the n x n identity 

matrix. Note that H = = H^. The empirical covariance operator C : H ^ H 

acts on (?!)(x) e H as C ^(x) := - (p^) {(p{xi) - (p^,(p(x)) = 

i 0(x,)(^(x,))^(^(x) = i<J)<I)^(^(x). Here, we use the transpose notation 
(p{x)^(p{x') \= ((j)(x), (p{x')) as an abbreviation for inner products, hence <L^(/)(x) 
denotes a column- vector of inner products. We can therefore interpret as an 

operator and identify the empirical covariance as C = i 
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The transformation cj) acts as a (usually) nonlinear map to a high-dimensional 
space "H in which the classification task can be handled in either a more efficient 
or more beneficial way. In pracfice, we will nol necessarily know (p, but choose 
a kernel function k : X x X ^ 'R that encodes the inner product in Ti., instead. 
The kernel k is a positive definite function such that k{x,x') = cf>(x)^(f>(x') for 
any x, x' e X. Particular instances of such kernels are the Gaussian Radial Basis 
Function krbf(^, x') := exp(— y | |jc — x'\\'^) for y € K+ and the polynomial kernel 
^poi := (1 + [x,x')Y for e N. Given that X = the kernel krbf represents 
an inner product in an infinite dimensional Hilbert space Ti, in contrast to a finite 
dimensional space for the polynomial kernel kpo\- For details on kernel methods 
we refer to Schdlkopf and Smola (2002) and Shawe-Taylor and Cristianini (2004). 
K := is an n x n kernel matrix derived from the training data. Moreover, we 
will also use the centered kernel matrix K := = HKH . Further, 

for an arbitrary X e <T, := [k{x\,x), ..., k{x„,x)]^ = ^~^(j){x) denotes the 

vector of kernel values of x to the training data, while := <I>^i^(x) = H{k.x — 
^Kl„) is the centered vector. Finally, we will also use the self-similarity k^x ■= 
k(x, x) = (/)(x)V(-^) its centered version kxx = ^{xY^ix) = kxx — + 

In addition to the quantities defined for the complete sequence <I>, we can 
define analogous class-wise quantities which are indicated with the superscript [ 7 ]. 



3 Kernel Mahalanobis Distance Classifiers 

With the above notation, the Mahalanobis distance in the kernel-induced feature 
space Ti can be formulated purely in terms of kernel evaluations as we derive in the 
following. Then we introduce the subsequent classifiers. 



3.1 Kernel Mahalanobis Distances for Invertible Covariance 

For simplicify of presenfafion, we consider here a single class of n elemenfs <I) = 
[i^(xi), . . . , (/)(x„)]. For classification, the resulting formulae will be used in a class- 
wise manner. We require here an invertible empirical class covariance operator C in 
the kernel-induced space. This limits our reasoning to a finite-dimensional TI, as the 
image of C based on n samples has a finite dimension m < n. We want to kernelize 
the empirical square Mahalanobis distance 

d^(cj){xy,{(p^,C]) ■- {(p{x) - (pff c~\(p{x) - (p^). (1) 

Since Ti is m -dimensional, with m < n, v/e may interpret <J) as an m x « matrix. 
Hence, it has a singular value decomposition $ = {75 with orthogonal matrices 
U FgK"^" and a diagonal matrix 5 gK'"^". By using the orthogonality of 

U and V, we have: C = = WSS'^U'^ and K = = VS'^SV'^, with an 
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invertible matrix SS^e but singular S^S e So C“' = « 17(55"^)“' 

and K~ = where the superscript “ denotes the pseudo-inverse. 

Multiplication of these equations with $ yields = C/(SS^)“'SF^ and 

Since S e is diagonal and has m nonzero singular 

values, both middle matrices {SS'^)~^S and S(S^S)~ are m x n diagonal matrices 
with inverted singular values on the diagonal. Therefore, these matrices are identical 
and we conclude that 



( 2 ) 

n 

Given a centered vector ^(x) = <p{x) — C acts on ^{x) as follows: 

C^{x) = i|>|)TL(x)--<I)l„)= -l>L. (3) 

n \ n J n \ n J n 

Since C is invertible, this implies with (2) that <^(x) = 

Together with the identity (2) this allows us to express the Mahalanobis distance for 
invertible covariance operator in its kemelized form as: 

dfcix) := {(piji, C}) = = n k^(^“)^k;,. (4) 

In practice the computation of K~ relies on a threshold a > 0 such that singular 
values smaller than a are treated as 0. Hence, the distance dj(, has a regularization 
parameter a, which must be chosen properly during training. 



3.2 Kernel Mahalanobis Distance for Regularized Covariance 

The empirical covariance operator may not be invertible as we work with finite 
samples in a high-dimensional/infinite dimensional space 7i. As an ansatz we 
directly regularize the covariance operator to prevent it from being singular: Creg := 
C -t- a^In = 2$$^ o^In, where > 0 is a parameter to be chosen. After 
multiplying by <I> from both sides, using K = $^4) and defining \= K + a In 
for O' := na^, we get Creg^ = + na^I„) = 2<J).^reg- As a result, both Creg 

and ^reg are strictly positive definite, hence non-singular, as na^ > 0. The inverses 
are therefore well-defined, leading to an equivalent of (2) as 

= ic-g'l). (5) 



Note that Creg acts on an arbitrary centered vector ^{x) as Cregi^(x) = -)- 

^(x), directly following from (3). Since Creg is invertible, we obtain 



Classification with Kernel Mahalanobis Distance Classifiers 



355 



^ Qg + CT^ C„g‘ ^(x). (6) 

After multiplying (6) on both sides by ^{x)^ (from the left) and thanks to (5), we 
can write ^{x)^^{x) = ^{x)^ ^K~^kx + o^^{x)^C~^^{x). We can solve for the 
desired square Mahalanohis distance in the last term. By using the kernel quantities 
kxx = ^{x)^^{x) and we obtain the kernel Mahalanobis distance for 

regularized covariance 



dlc(x) ■- d^{(p{x)\ {(p^, Creg}) = “ kli^reg'k.O- (7) 



3.3 Classifiers Based on Kernel Mahalanobis Distances 

Kernel Quadratic Discriminant ( KQD ). First, we consider the straightforward exten- 
sion of Quadratic Discriminant (QD) analysis in Euclidean spaces. This leads to 
Kernel Quadratic Discriminants (KQD) (P^kalska & Haasdonk, 2009). For a c- 
class problem in a space X = with regular class-wise covariance matrices 
Shl, means and prior probabilities P(cOj), the quadratic discriminant for the 
y-th class is given as := —^(x — yt t-' !)'''( S I-' l)“'(x — -|- bj, where 

bj ■- -iln(det(S[^'l)) -b ln(F(c«y)). A new sample x: is classified to o), with 
i = argmax^=i_ fd\x)\ see for instance Duda, Hart, and Stork (2001). 

By inserting the class-wise kernel Mahalanohis distances, two different decision 

functions are obtained for KQD, fd(^{x) \= — (c(|{J(x))^ -|- bj and fH^ix) \= 
— (jj/(l(x))^ -|- bj for the invertible and regularized covariance case, respectively. 
The offset bj can be expressed by kernel evaluations thanks to ln(det(Cl'']) = 
ln]~[(A!"'^) where the eigenvalues of are identical to the eigenvalues of 
fori = 1,...,/ := rank(Kl''l). Numerical problems however arise in 
computing the logarithm of the eigenvalue-product, if many small eigenvalues 
occur. This happens in practice because a kernel matrix has often a slowly decay- 
ing eigenvalue spectrum. Consequently, we choose the offset values by a training 
error minimization procedure; see P^kalska and Haasdonk (2009) for details. In the 
following we refer to the resulting classifiers as KQD-IC and KQD-RC. 

Fisher and Quadratic Discriminants in Dissimilarity Spaces. We can define new 
features of a low-dimensional space by the square kernel Mahalanobis distances 
computed to the class means. Hence, given a c-class problem and class-wise 
squared dissimilarities {dd^{x))^, j = 1, . . . , c, we can define a data-dependent 
mapping to a c-dimensional dissimilarity space i/f : A” — ^ with ip(x) := 

[(<i['](x))^, . . . (c(^^l(x))^]^. This can be done for either the or d\(. distances. 
For c = 2 classes, the KQD decision boundary is simply a line parallel to the main 
diagonal in this 2D dissimilarity space. For certain data distributions, more com- 
plex decision boundaries may be required. Since kernel Mahalanobis distances are 
derived based on the within-class information only, subsequent decision functions 
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in this dissimilarity space enable us to use the between-class information more 
efficiently. Two classifiers are here considered, namely Fisher Discriminants (FD) 
and Quadratic Discriminant (QD); see, e.g., Duda et al. (2001). Since we apply 
these in two dissimilarity spaces defined by either df(^ or we get four addi- 
tional classification strategies denoted as FD-IC, FD-RC, QD-IC and QD-RC, 
correspondingly. 



4 Experiments 

In order to get insights into the kernel Mahalanobis distances, we first perform 2D 
experiments on an artificial data set for different sample sizes and kernels. Then 
we target at some real-world problems. We include three reference classifiers to 
compare the overall classification performance. These are two linear kernel classi- 
fiers, Support Vector Machine (SVM) (Schdlkopf & Smola, 2002) and Kernel Fisher 
Discriminant (KFD) (Mika et ah, 1999), and a nonlinear Kernel k-Nearest Neigh- 
bor (KNN) classifier. The KNN classifier is based on the kernel-induced distance in 
the feature space \\(p(x) — (p(x')\\'^ = k{x,x) — 2k{x,x') -|- k{x' ,x'), which cor- 
responds to the usual k-nearest neighbor decision in the input space for a Gaussian 
kernel. The regularization parameters are the usual C for penalization in SVM, 
for regularizing the within-class scatter in KFD and the number of neighbors k for 
KNN. All experiments rely on PRtools41 (http://prtools.org). 



4.1 Experiments on 2D Toy Data 

We consider a two-class toy problem as illustrated in Fig. 1. Both classes have equal 
class-priors and are generated by a mixture of two normal distributions such that 
the resulting distributions are no longer unimodal. Hence, QD analysis is invalid 
here and stronger nonlinear models must be applied. The training set consists of 
200 samples. We study both Gaussian and polynomial kernels, krbf and kpoi. The 
optimal kernel parameters and regularization parameters of the classifiers are chosen 
by 10-fold cross-validation. The cross validation range for the kernel parameters are 
Y e [0.01,50] discretized by eight values and p = 1,2, 3, 4. The regularization 
parameters and cross-validation ranges (each discretized by eight values) are a e 
[10~®, 10“’] (class-wise identical) for KQD-IC, FD-IC and QD-IC, a = na^ e 
[10-5, j] (class-wise identical) for KQD-RC, FD-RC and QD-RC, C e [10“’, 10®] 
for SVM, k e [1,8] for KNN and e [10“®, 10] for KFD. The resulting kernel 
Mahalanobis classifiers with kernel k^f and the training data set are depicted in the 
left plot of Fig. 1 . The right plot shows the reference classifiers. The KNN rule is, as 
expected, highly nonlinear. Overall, all classifiers perform reasonably well. 

The classification errors are determined on an independently drawn test set of 
1,000 examples. The procedure of data drawing, cross-validated training of the 
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classification boundaries in R^. 



classification boundaries in R^. 





Fig. 1 Cross-validated classifiers on 2D toy data with kernel k^f 



Table 1 Average classification errors (in percent) for 2D data with different training sample sizes 
n and kernels. Numbers in parenthesis denote standard deviations 







^rbf 






^pol 




n = 50 


n = 100 


n = 200 


o 

IT) 

11 


n = 100 


n = 200 


KQD-IC 


20.8 (4.2) 


17.4(1.1) 


15.5 (1.4) 


18.8(1.8) 


17.2(1.6) 


16.0(1.5) 


FD-IC 


20.9 (3.8) 


17.9 (2.3) 


15.7 (2.0) 


20.8 (5.0) 


19.3 (2.6) 


16.0(1.2) 


QD-IC 


21.7 (4.8) 


16.7 (0.9) 


16.0(1.7) 


19.7 (3.5) 


18.4 (2.3) 


17.3 (1.8) 


KQD-RC 


18.8 (2.1) 


16.2(1.0) 


15.3 (1.6) 


16.2(1.8) 


16.5 (1.8) 


14.9 (1.2) 


FD-RC 


18.4 (2.1) 


17.5 (1.8) 


15.3 (1.7) 


17.5 (2.9) 


17.9 (2.7) 


15.5 (1.0) 


QD-RC 


18.5 (2.2) 


15.8(1.2) 


14.9(1.8) 


19.5 (4.2) 


18.5 (3.1) 


17.2(1.9) 


KFD 


19.5 (3.1) 


16.5 (2.2) 


14.7 (1.4) 


16.7 (2.3) 


16.4 (2.4) 


14.5 (1.2) 


SVM 


19.0 (2.0) 


17.0(1.8) 


16.1 (2.7) 


17.4 (2.4) 


19.6 (6.3) 


17.9 (2.2) 


KNN 


18.6 (3.0) 


16.3(1.6) 


15.4(1.6) 


17.7 (2.8) 


17.0(2.5) 


16.7(1.4) 



classifiers and test-error determination is repeated for ten random training and test- 
set drawings. The mean errors and standard deviations are shown in Table 1. To 
assess the dependence on the sample number, we also determine results for smaller 
training sample sizes n . 

Among the reference classifiers we see that nonseparability is problematic for 
SVM as it performs worse than the KNN approach for pronounced cases (larger n). 
KFD is frequently similar or better than SVM, as also reported in other stud- 
ies (Mika et al., 1999). Among the different Mahalanobis distances we observe 
a superiority of the approaches based on over those using dj^^. The differ- 
ence in performance is increasing with the decrease of the sample size n . KQD-IC 
seems favorable among the IC-approaches. Concerning the RC-approaches, QD-RC 
seems favorable for while KQD-RC seems favorable for kpoi. Good results are 
obtained by both k^\,f and kpoi for this data set. Comparing the kernel Mahalanobis 
approaches to the reference methods, the former provide similar results to those of 
the reference classifiers. 
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Table 2 Data used in our experiments and hold-out ratio 



Data 


#Obj. 


#Feat. 


#Class 


Class sizes 


rtr 


Variables 


Biomed 


194 


5 


2 


127/67 


0.50 


Mixed 


Diabetes 


768 


8 


2 


500/268 


0.50 


Mixed 


Ecoli 


272 


6 


3 


143/77/52 


0.50 


Continuous 


Glass 


214 


9 


4 


70/76/17/51 


0.50 


Continuous 


Heart 


297 


13 


2 


160/137 


0.50 


Mixed 


Imox 


192 


8 


4 


48 


0.50 


Integer-valued 


Ionosphere 


351 


34 


2 


225/126 


0.50 


Continuous 


Liver 


345 


6 


2 


145/200 


0.50 


Cont. integer-valued 


Mfeat-Fac 


2,000 


216 


10 


200 


0.15 


Continuous 


Mfeat-Fou 


2,000 


76 


10 


200 


0.15 


Continuous 


Sonar 


208 


60 


2 


97/111 


0.50 


Continuous 


Wine 


178 


13 


3 


59/71/48 


0.50 


Continuous 



4.2 Real-World-Experiments 

We use data from the UCI Repository (http://archive.ics.uci.edu/ml/). They describe 
problems with categorical, continuous and mixed features and with varying number 
of dimensions and classes. Each data set is split into training and test sets in the ratio 
of Vtr as specified in Table 2. We standardize the vectorial data and apply a Gaussian 
kernel k^f. For multiclass problems, SVM and KFD are trained in the one-vs.-all 
scenario. As before, the optimal kernel parameter y and regularization parameters 
of all classifiers are determined by 10-fold cross-validation with partially slightly 
adjusted search ranges, i.e., a e [10~®, 5 x 10“’] for KQD-IC, FLD-IC and QD- 
IC, a = na^ & [10“^ 2] for KQD-RC, FLD-RC and QD-RC, C e [10“’, 10®] for 
SVM, k G [1,15] for KNN and P G [10“®, 2] for KFD. The average test-errors and 
the standard deviations over ten repetitions are reported in Table 3. 

Concerning the reference methods, we observe that KFD is mostly best, some- 
times outperformed by SVM. Among the kernel Mahalanobis classifiers we again 
note that the RC-versions are almost uniformly better than the IC-versions. In a 
number of cases the IC-versions are clearly inferior (Ecoli, Glass, Heart, Mfeat-*, 
Sonar, Wine, Ionosphere). This occurs when the number of samples is low as com- 
pared to the original dimensionality. Interestingly, QD-RC often gives similar or 
better results than KQD-RC, which is not analogous for the IC-versions. The kernel 
Mahalanobis classifiers are mosfly comparable to the reference classifiers for both 
binary and multiclass problems. QD-RC performs overall the best (also better than 
reference classifiers) for fhe Diabetes, Imox, Ionosphere, and Wine data. Both KQD- 
RC and QD-RC classifiers are beffer fhan the reference classifiers for the Imox and 
Sonar data. 



5 Discussion and Theoretical Considerations 

We focus now on some theoretical aspects concerning the kernel Mahalanobis 
distances with respect to their usage. 
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Table 3 Average classification errors (in percent) for real data and kernel Numbers in 
parenthesis denote the standard deviations 





Biomed 


Diabetes 


Ecoli 


Glass 


Heart 


Imox 


KQD-IC 


16.2(3.8) 


28.3(1.8) 


7.6(3.6) 


46.7(8.2) 


20.5(2.0) 


7.2(2.4) 


FD-IC 


22.6(5.0) 


32.6(2.4) 


12.0(3.0) 


49.8(3.5) 


21.1(2.0) 


14.1(4.6) 


QD-IC 


16.5(4.1) 


29.6(2.2) 


7.2(2.8) 


52.0(4.2) 


21.9(1.9) 


8.4(2.0) 


KQD-RC 


16.6(3.1) 


28.2(2.1) 


5.9(1.6) 


44.0(6.3) 


16.7(1.9) 


9.2(3.4) 


FD-RC 


16.4(4.4) 


28.2(1.2) 


5.8(1.8) 


44.4(4.3) 


17.1(2.4) 


10.9(3.8) 


QD-RC 


15.5(2.8) 


25.8(2.3) 


5.6(1.9) 


40.7(4.8) 


18.3(2.7) 


6.6(2.5) 


KFD 


16.5(2.8) 


26.3(2.1) 


5.2(1.6) 


36.7(5.7) 


18.4(2.3) 


9.4(2.2) 


SVM 


15.2(2.3) 


28.9(2.3) 


5.2(2.3) 


39.3(5.0) 


16.4(2.3) 


10.1(3.3) 


KNN 


20.6(3.7) 


30.8(0.9) 


7.4(2.1) 


43.9(5.3) 


17.3(2.6) 


9.6(5.3) 




Ionosphere 


Liver 


Mfeat-Fac 


Mfeat-Fou 


Sonar 


Wine 


KQD-IC 


11.2(2.6) 


35.6(4.0) 


10.0(1.7) 


61.4(3.2) 


29.5(5.7) 


5. 1(2.6) 


FD-IC 


12.2(2.5) 


41.8(3.8) 


13.5(1.4) 


55.7(3.4) 


31.7(4.8) 


6.5 (2.8) 


QD-IC 


11.7(2.2) 


42.1(3.7) 


12.4(1.8) 


35.5(2.7) 


35.5(3.1) 


7.4(3.3) 


KQD-RC 


7.8(3.3) 


39.6(4.6) 


6. 1(0.6) 


25.1(1.4) 


15.7(3.2) 


3.8(1.4) 


FD-RC 


7.5 (2.0) 


37.6(2.9) 


7. 1(0.8) 


26.6(1.4) 


22.0(4.0) 


3.5(1.4) 


QD-RC 


5. 8(1.7) 


39.6(3.4) 


6. 1(1.0) 


25.7(1.1) 


16.6(2.5) 


2.8(1.7) 


KFD 


6.8(2.2) 


32.9(2.6) 


3.9(0.6) 


22.9(0.9) 


17.7(3.3) 


3.8(1.9) 


SVM 


7.K1.4) 


30.4(3.1) 


4.7(0.6) 


23.0(1.0) 


18.2(5.3) 


3. 1(1. 8) 


KNN 


23.9(14.7) 


41.2(3.7) 


8. 1(6.2) 


28.3(1.6) 


19.8(3.9) 


8.3(7.1) 



Assumption on Invertible Covariance. The motivation behind the distance df(^ 
requires that the covariance operator is invertible. As a theoretical consequence, 
the sound derivation is limited to a finite dimensional H. This is violated, e.g., for 
the Gaussian kernel k^bf- Counterintuitive situations may occur if non-singularity 
does not hold: a vector k;^ in (4) may be nonzero but lie in the eigenspace of K 
corresponding to the eigenvalue 0. This may occur if x is atypical with respect 
to the training samples. Simple computation yields d}(.{x) = 0. If a classifier 
uses the distance as indication of a likelihood of x belonging to the correspond- 
ing class, the classification result will be clearly counterintuitive and possibly 
wrong. This phenomenon can be demonstrated on a simple 2-class XOR-data 
(X = {(-1,-1)^, (—1, 1)\ (1,-1)^, (1, 1)^}, y = {cDi,a) 2 , 0 ) 2 , as illustrated in 
Fig. 2, where the first class is plotted as circles and the second class as crosses. We 
plot a shading of the square kernel Mahalanobis distances of the first class resulting 
from the kernel k^bt with y = 1. We clearly see the fundamental qualitative dif- 
ference between (left, a = 10“'*) and (right, = 1). The left plot 
demonstrates the problematic case, where the training examples of the first class 
have a higher distance to their own class than the samples of the second class. This 
illustrates why the discrimination power of the df^ distances can decrease for few 
training samples in high-dimensional H, as observed in our experiments. Neverthe- 
less, the IC-methods are still applicable for infinite dimensional H. Formally, the 
final decision rules are still well defined and can be applied independently whether 
the covariance operator is singular or not. Empirically, the results are frequently 
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Fig. 2 XOR-example and square kernel Mahalanobis distances for the kernel The left plot 
shows while the rig/tfplot shows 



quite good. We may conclude that the pathological cases are rarely observed in 
practice if sufficiently many samples are available for training. Still a decrease 
in classification accuracy may be observed for few samples in high-dimensional 
spaces. In these cases, the use of is clearly more satisfactory and beneficial 
from a theoretical point of view. 

Invariance. An interesting theoretical issue is invariance of the Mahalanobis dis- 
tances in the kernel feature space. These invariance properties naturally transfer 
to kernel transformations that do not affect the resulting distances. One can eas- 
ily check by definitions that the Mahalanobis distance is translation invariant in 
the feature space, i.e., ^{x) := (j)(x) -f cpQ for a translation vector cpQ e 
Choosing (/>o := 4>{xo) for any xo € Af (or a general arbitrary linear combina- 
tion) implies that both and d^^. remain identical by using the shifted kernel 
k{x,x') := {^(x),^(x')) = /:(x, x')+^(x, Xo)-|-/c(x', Xo)-t-^(xo, Xo). In particular, 
kernel centering does not affect the distances. In analogy to Euclidean Mahalanobis 
distances, kernel Mahalanobis distances are invariant to scaling of the feature space 
by using the scaled kernels k(x,x') := 9k(x,x') for 9 > 0. As we involve reg- 
ularization parameters, this invariance only holds in practice if the regularization 
parameters are similarly scaled a := 9a and := 9a^. Consequently, a kernel 
can be used without a scale-parameter search. 



6 Conclusion 

We presented two versions of kernel Mahalanobis distance, dj(. and derived 
either for invertible covariance operators or based on an additive regularization 
thereof. The distance d\(^ leads to empirically better classification performance 
than dj(., in particular for small sample size problems. Overall, the former measure 
is both conceptually and empirically favorable. These two Mahalanobis distances 
represent one-class models as only the within-class kernel information is used for 
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their constructions. The between-class information is utilized in subsequent clas- 
sifiers. Fully kernelized quadratic discriminant analysis can be performed by the 
KQD-IC/KQD-RC methods. Additional classifiers can be applied in the dissimilar- 
ity space obtained from the kernel Mahalanobis distances as illustrated with Fisher 
Discriminants FD-IC/FD-RC and Quadratic Discriminants QD-IC/QD-RC. Empiri- 
cally, they often give comparable results to the reference classifiers. In several cases, 
QD-RC gives the overall best results. The kernel Mahalanobis classifiers can be 
advantageous for problems with high class overlap or nonlinear pattern distributions 
in a kernel-induced feature space. 
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Identifying Influential Cases in Kernel Fisher 
Discriminant Analysis by Using the Smallest 
Enclosing Hypersphere 



Nelmarie Louw, Sard Steel and Morne Lament 



Abstract Kernel methods have become standard tools for solving classification and 
regression problems in statistics. An example of a kernel based classification method 
is Kernel Fisher discriminant analysis (KFDA). Conceptually KFDA entails trans- 
forming the data in the input space to a high-dimensional feature space, followed 
by linear discriminant analysis (LDA) performed in feature space. Although the 
resulting classifier is linear in feature space, it corresponds to a non-linear classifier 
in input space. However, as in the case of LDA, the classification performance of 
KFDA deteriorates in the presence of influential data points. Louw et al. (Commu- 
nications in Statistics: Simulation and Computation 37:2050-2062, 2008) proposed 
several criteria for identification of influential cases in KFDA. In extensive simu- 
lation studies these criteria have been found to be successful, in the sense that the 
error rate of the KFD classifier based on the data set after removal of influential 
cases, is lower than the error rate of the KFD classifier based on the entire data set. 
A disadvantage is that these criteria are calculated on a leave-one-out basis, which 
becomes computationally expensive when dealing with large data sets. In this paper 
we propose a two-step procedure for identifying influential cases in large data sets. 
Firstly, a subset of potentially influential data cases is found by constructing the 
smallest enclosing hypersphere (for each group) in feature space. Secondly, the pro- 
posed criteria are employed to identify influential cases, but only cases in the subset 
are considered on a leave-one-out basis, leading to a substantial reduction in com- 
putation time. We investigate the merit of this new proposal in a simulation study, 
and compare the results to the results obtained when not using the hypersphere as a 
first step. We conclude that the new proposal has merit. 

Keywords Classification • Discriminant analysis • Kernel methods. 
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1 Introduction 

Consider a binary classification problem in which it is desired to use data with 
known group membership (called the training data) to obtain a classification rule 
(or classifier) that can be used to classify new data cases with unknown group mem- 
bership. A well known and widely used procedure is Fisher’s linear discriminant 
analysis (LDA), which performs very well when the data come from homoscedastic 
populations that are (approximately) normally distributed. Linear classifiers do not 
always perform well, and several non-linear classification procedures have there- 
fore been developed. Mika, Ratsch, Weston, Scholkopf, and Muller (1999) proposed 
Kernel Fisher discriminant analysis (KFDA), a kernel based extension of Fisher’s 
linear discriminant rule to a non-linear classifier and showed that the KFDA clas- 
sifier yields error rates comparable to that of the support vector machine (SVM), a 
well known kernel based classifier. 

Several studies investigating influential data points in LDA have appeared in the 
literature (cf. Crithchley & Vitiello, 1991; Fung, 1992; and Fung, 1995). Aspects 
considered in these papers include the accuracy of parameter estimates, estimated 
posterior probabilities of group membership, and the classification performance 
(quantified in terms of the misclassification/error rate) of the resulting discriminant 
rule. In our work, we define an influential data case as one whose omission from 
the training data prior to constructing the classifier leads to a reduction in error rate. 
Often the point being investigated may be described as atypical, i.e., a point which 
is in a certain sense different from the rest of the cases (in a specific group). In this 
paper we use the term atypical case to specifically refer to a mislabelled point, i.e., 
a point actually coming from Population 1 but carrying a Population 2 label, or vice 
versa. Three criteria for identifying influential data cases are discussed in Sect. 3. 
Although these criteria are not based on error rate estimates, empirical evidence 
presented in Louw, Lament, and Steel (2008) confirms that they largely succeed in 
identifying influential data cases. 

A disadvantage of these criteria is that they are calculated on a leave-one-out 
basis, making the implementation computationally expensive. In this paper we 
therefore propose using the smallest enclosing hypersphere as a pre-processing step 
to reduce the number of cases to be evaluated by the criteria on a leave-one-out 
basis. This leads to a substantial reduction in computation time. 

The paper is organised as follows. Section 2 introduces required notation, and 
provides technical details on KPDA. In Sect. 3 we review the criteria for identifying 
influential cases. Section 4 contains a description of the smallest enclosing hyper- 
sphere in feature space as well as our proposal for using this as a pre-processor. 
The merit of the proposal is investigated in a Monte Carlo simulation study which 
is discussed in Sect. 5. Application of the procedure to a data set appears in Sect. 6. 
Section 7 contains conclusions and open problems. 
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2 Kernel Fisher Discriminant Analysis 

Consider the following two-group classification problem. We observe a binary 
response variable Y e {— 1, -)-l}, together with a number of classification or input 
variables Xi, Xi , . . . , Xp. These variables are observed for n = n\ + ri 2 sam- 
ple cases, with the first n\ cases from Population 1 and the remaining «2 cases 
from Population 2. The mean vectors of the two populations will be denoted by 
fi\ and fi 2 respectively, and the covariance matrices by Ei and X 2 . We write 
Ji = {1, 2, . . . , «i} for the set of indices corresponding to the cases from Popula- 
tion 1 , and J 2 = {n\ -|- 1 , « 1 -|- 2 } for those from Population 2. The resulting 
training data set is denoted by T = {(x, = 1 , 2 ,..., «}, where x, is a p- 

vector containing the observed values of X\,X 2 , . . . ,Xp for the i-th sample case, 
i = 1, 2, . . . , n. The objective is to use T to determine a rule that can be used to 
assign a new case with observed values of the predictor variables in a vector x to 
one of the two populations. Conceptually KFDA entails transforming the original 
data in the input space to a high-dimensional feature space, followed by application 
of the usual LDA procedure in feature space. We denote the transformation function 
by <!>(.), and the resulting data cases in feature space by <F(x,), i = 1,2, ... ,n. 
The dimensionality of the feature space is typically very high and can even be infi- 
nite, so performing calculations in this space is difficult or impossible. However, if 
an algorithm can be expressed in a form which contains the mapped data only as 
inner products, the kernel trick can be used to circumvent the problem. The ker- 
nel trick, which is based on the theory of reproducing kernel Hilbert spaces and 
specifically Mercer’s theorem, entails replacing inner products by a kernel func- 
tion, K{Xi,Xj) = (<I)(x,), $(x^)). This obviates the need to specify a mapping 
<I> or to perform calculations in feature space. Although the classifier resulting 
from this will be linear in feature space, it corresponds to a non-linear classifier 
in input space. The KFDA classifier is given by sign {b -(- a, K(x, , x)}. Here 
b and ai,a 2 , . . . ,a„ are quantities determined by applying the KFDA algorithm to 
the training data. Throughout this paper we will use the popular Gaussian kernel, 
defined by K(xj , Xj ) = exp(—y ||x,- — Xj |p), where y is a so-called kernel hyperpa- 
rameter that has to be specified beforehand or determined from the data. Empirical 
evidence suggests that y = 1/p generally works well in KFDA and we will use this 
throughout the paper. Evaluating K(xi,xj) for i, j = 1, 2, . . . , n, we are able to 
construct the so-called Gram matrix, K, with i j th entry K(x, , x^ ). The constants a, 
are determined as follows. Let a be an « -vector with elements ai, a 2 , ■■■ .ot,,. Then 
a maximises the Rayleigh coefficient 



r(cc) = 



a'Mo! 

a'Na 



( 1 ) 



ln(l),M = (mi— m 2 ) (mi —m 2 )', and N = KK'—nimim'j—« 2 iii 2 ni 2 , where the 
n elements of mi are given by ^ Yl'j=i K{xi .Xj) , with a similar expression for m 2 . 
The analogy with classical linear discriminant analysis is clear: we may interpret M 
as the between group scatter matrix, and N as the within group scatter matrix, in 
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both cases taking into account that we are effectively working in the feature space 
induced by the kernel function. For a more detailed discussion of KFDA, see Mika 
et al. (1999). 

It is well known that N“'(mi — m 2 ) will maximize (1). There is however one 
problem: the matrix N is singular and consequently we cannot find a by simply 
calculating N“'(mi — m 2 ). Mika et al. (1999) propose and motivate the use of reg- 
ularization to overcome this difficulty. In the present context regularization entails 
replacing N by a matrix N 2 , = N -b Al, for some (small) positive scalar A. This 
yields a solution N^'(mi — m 2 ) , depending on A, which can be used in (1). In 
addition to overcoming the singularity problem, this affords one the opportunity of 
implementing regularisation, which typically leads to reduced generalisation error 
(cf. Mika et al., 1999). Obviously the hyperparameter A has to be specified, and 
this is typically done by performing a crossvalidation search along a suitable grid of 
potential A-values. 

The intercept b can be specified in different ways. A popular choice, which we 
will also use, is b = 0.5(m2N^*m2 — mjN^’mi) -|- log{n\! 112 ), which is similar 
to the intercept used in linear discriminant analysis. 



3 Criteria for Identifying Influential Cases in KFDA 



In this section we review some of the criteria which were proposed by Louw et al. 
(2008) for identification of influential cases. These criteria are calculated on a leave- 
one-out basis. For i = 1, 2, . . . , «, the i-th case is omitted from the training data, 
and the remaining cases are used to calculate the Gram matrix, denoted by The 
KFDA classifier is also calculated and the resulting a-vector is denoted by , and 
the intercept by This is done for i = 1, 2, . . . , 72 , and the optimal (minimum or 
maximum depending on the specific criterion) value of the criterion is found. The 
case corresponding to this optimal value is the case identified by that criterion as 
influential. 

The first criterion is based on the concept of maximising the ratio of the between 
group and within group variation, calculated in feature space. The squared distance 
between the two group mean vectors in feature space is given by 






I Xk) + 



^ j k€.J\ 



2 j 6/2 k€j 2 
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while the average squared deviation of the observations in group j from their mean 
is given by 



K(xk.Xk) - K(Xi,Xk), j = 1,2. 

ksjj JiSJjkSJj 
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The ratio of the between group variation and the within group variation in feature 
space, viz. v = - 2^-2 therefore be calculated. We use this ratio to identify the 

most influential case by calculating for i = 1,2, ... ,n. Since a large value 
of the ratio of the between group and within group variation is desirable, the case 
whose omission results in the maximum value of is declared to be influential. 

In order to define the second criterion, let f(xi) = Yl'j=i^j j 
I = 1,2, ... ,n. Then the margin of case / is yif (xi), and this is positive if and only 
if case / is correctly classified. It is a well known fact that a large average margin 
is often associated with good classification performance. Taking our cue from this, 
we propose m = ^ YH=i yi fi^i) ^ criterion for identifying an influential case. 
We calculate the value of and identify the case whose omission results in the 
maximum value of as influential. 

For the last criterion we calculate the Rayleigh coefficient (1) after omitting 
case i, i.e. 

..(0 _ 

~ a('VN(Oa(0 ’ 

The observation whose omission maximises this criterion is pronounced to be 
influential. 



4 The Smallest Enclosing Hypersphere 

The theory of the smallest enclosing hypersphere in feature space is explained in 
Tax and Duin (1999) and Shawe-Taylor and Cristianini (2004). It is constructed 
as follows. Consider a data set in feature space. An enclosing hypersphere for 
this data set can be specified in ferms of its centre c and the distance, r, from c 
to the furthest point in the data set. The smallest enclosing hypersphere has cen- 
tre c* = arg mine {max,- ~ ^"ll} and radius r* = max,- ||<I>(x,) — c*||. 

Tax and Duin (1999) argue that c* and r* can be found by solving the follow- 
ing quadratic optimisation problem: minc,r{r^), subject to ||$(x,) — c|p < r^, 
Vi = 1,2, ... ,n. Introducing Lagrange multipliers a\ , at, . . . ,a„ and using the 
kernel trick to replace inner products, the dual formulation of this optimisation prob- 
lem becomes max„ at K(xi , x,) — , -*^ 7)15 subject to 

a,- = 1 and a,- > 0, Vi = 1,2,..., n. Finding the optimal values, denoted by 
a* ,a 2 , . . . ,a*,is, done by using a quadratic programming algorithm. 

The smallest enclosing hypersphere possesses an important sparseness property 
(cf. Shawe-Taylor & Cristianini, 2004): only the observations lying on the surface 
of the hypersphere, typically a relatively small proportion of the data set, have non- 
zero a*'s. These cases are referred to as support points. Our proposal is to construct 
the smallest enclosing hypersphere for each of the two groups, and then consider 
only the support points as potential influential cases, to be evaluated more closely 
on a leave-one-out basis by the three criteria described in Sect. 3. This proposal is 
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based on the supposition that an influential point from a given group is likely to be 
a support point for the smallest enclosing hypersphere of that group. 

The Gaussian kernel was used in construction of the hypersphere. It should be 
noted that the hyperparameter y appearing in this kernel function has an influence 
on the number of hypersphere support points, which increases with y (cf. Lamont, 
2008). Empirical evidence suggests that y = 1/ p leads to approximately 15-25% 
of the data points being support points. Since this seems to be a reasonable compro- 
mise between a support set which may be too small (in the sense that it probably 
does not contain the most influential data point) and too big (in the sense that little 
computation time is saved by investigating only the support points), this value was 
used throughout our investigations. It should be noted that fitting the two hyper- 
spheres is very fast so that the reduction in the number of points to be investigated 
provides a good indication of the computation time savings. 



5 Monte Carlo Simulation Study 

To evaluate the merit of the proposed criteria, a detailed Monte Carlo simulation 
study was performed. We used a four-factor experimental design in which we varied 
the following factors: 

• The underlying input variable distribution: we used normal and lognormal distri- 
butions. 

• Differences between the two populations: location and spread differences were 
investigated. 

• Correlation between the (equi-correlated) input variables: p = —0.1, 0 and 0.7 
were investigated. 

• Training sample sizes: ni = «2 = 100 and iii = «2 = 200 were considered. 

With respect to the number of input variables, p = 5 was used. Training data 
were generated from the relevant distribution for each of the two groups, and con- 
taminated by inserting a mislabelled case into group 1 . The KFD classifier was then 
obtained using the full training data set. The hypersphere for each group was then 
calculated, using a Gaussian kernel with y = l//>. Two sets of support points were 
obtained, and only these points were considered as potential influential cases in 
the further analyses. The criteria defined in Sect. 3 were then applied to the sup- 
port points to identify the most influential data case. In each instance, this case was 
removed from the data, and the KFD classifier was obtained using the reduced data 
set. A large {iii = «2 = 5,000) test data set (without atypical cases) was then gen- 
erated from the same distribution as the training data, and classified using the KFD 
classier based on all the data cases, as well as each of the KFD classifiers obtained 
using the reduced data sets associated with each criterion. 

Table 1 contains the results obtained for normal data, while the lognormal results 
appear in Table 2. The percentage decrease in the estimated error rate if the cases 
identified by each criferion is omitted prior to obtaining the KFD classifier, is 
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Table 1 Percentage decrea.se in test error: normal data 




Sample size 


P 


v(0 




f(i) 


ml^'^ 




100 


-0.1 


-1.295 


-1.295 


-1.295 


-1.295 




100 


0 


0.296 


0.296 


0.292 


0.266 


Location 


100 


0.7 


-0.021 


-0.021 


-0.236 


0.043 


difference 


200 


-0.1 


0.188 


0.188 


0.188 


0.188 




200 


0 


0.098 


0.091 


0.098 


0.117 




200 


0.7 


-0.004 


0.000 


-0.008 


0.004 




100 


-0.1 


3.497 


2.055 


4.565 


4.550 




100 


0 


3.842 


2.552 


4.967 


4.946 


Dispersion 


100 


0.7 


2.557 


1.907 


3.029 


3.931 


difference 


200 


-0.1 


1.418 


1.263 


1.568 


1.558 




200 


0 


1.506 


1.366 


1.662 


1.644 




200 


0.7 


1.172 


0.909 


1.301 


1.820 



Table 2 Percentage decrease in test eiTor: lognormal data 



Sample size 


P 


v<0 




f(i) 




100 


-0.1 


2.657 


2.308 


2.606 


4.853 


100 


0 


0.687 


0.627 


1.473 


2.645 


Location 100 


0.7 


-0.14 


-0.101 


0.926 


0.599 


difference 200 


-0.1 


0.674 


0.711 


0.733 


1.882 


200 


0 


-0.310 


-0.307 


-0.619 


0.620 


200 


0.7 


0.558 


0.562 


0.245 


0.128 


100 


-0.1 


4.419 


4.461 


2.035 


7.430 


100 


0 


2.985 


3.077 


2.453 


3.853 


Dispersion 100 


0.7 


5.045 


4.955 


5.332 


1.569 


difference 200 


-0.1 


4.506 


4.558 


2.614 


3.185 


200 


0 


3.110 


3.283 


2.596 


1.703 


200 


0.7 


3.479 


3.467 


3.679 


0.904 



reported. In each case, the error rate after omitting the deliberately inserted mis- 
labelled data point, was also estimated, and the resulting decrease in error rate is 
reported in the last column of the tables, with heading 

The entries in the tables may be interpreted from two perspectives. Firstly, if 
we ignore the final column the entries in the remaining three columns reveal the 
extent to which the different criteria, applied after using the hypersphere as a filter, 
succeed in identifying a case which has a detrimental influence on the error rate. 
Since the majority of the entries are positive, and the negative entries are generally 
close to zero, we conclude that the criteria are largely successful in this regard. 
This is especially true for cases where the two populations differed with respect to 
dispersion. As is to be expected, the decrease in error rate is generally larger when 
«i = ii 2 = 100 than for the corresponding configuration with «i = «2 = 200. 

A second perspective on the results is obtained by comparing the entries for the 
criteria to those in the last column. In some configurations (some of) the criteria 
achieve the reduction in error rate resulting from consistently omitting the atypical 
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point. Of particular interest is the fact that in some configurations (some of) the 
criteria achieve a larger reduction than that appearing in the last column. This indi- 
cates that in these configurations there are often cases present in the training data 
which have an even more detrimental effect on the error rate than the (deliberately 
inserted) mislabelled point. 



6 Application to a Data Set 

In addition to the simulation study, the proposed procedure was also applied to sev- 
eral data sets. We report the results obtained on the Swiss bank note data (cf. Flury 
& Riedwyl, 1988) in which six variables, amongst which the length, width and 
diagonal length of 100 genuine and 100 forged thousand Swiss frank notes, were 
measured. The smallest enclosing hypersphere was obtained for each of the two 
groups, yielding 29 support points for group 1 and 3 1 for group 2. A Gaussian ker- 
nel with y = 0.2 was used to calculate the hyperspheres. The three criteria were 
then applied only to the support points. All the criteria identify case 70 as the most 
influential case. The cross-validation error rate (CVE) of the KFD classifier obtained 
on the full data set is 0.01350, which drops to 0.00756 after omission of case 70. 
The CVEs after omission of each of the 200 cases in turn, were also calculated to 
establish if case 70 is indeed the case whose omission decreases the error rate the 
most. In 198 instances the error rate is higher (ranging from 0.00902 to 0.01371) 
than that obtained after omission of case 70. However, after omission of case 1, 
a lower error rate (0.00586) is obtained. We then considered the data set without 
case 70, and repeated the two-step procedure. All the criteria achieved their optimal 
value upon deletion of case 1, indicating that if the procedure was carried out in a 
sequential way, case 70 would be identified first, and then case 1. The CVE after 
removal of both these cases, was 0.00140. 

The procedure was also repeated without using the hypersphere as a filter, imply- 
ing that all (200) data cases were evaluated by each of the criteria. Exactly the 
same results were obtained, (case 70 was identified as the most influential case, 
followed by case 1), indicating that using the hypersphere as a filter for this data 
set, reduced the computations by approximately 70% (only 30% of the cases were 
support points), while achieving the same reduction in error rate. 



7 Conclusions and Open Problems 

It is clear from the results presented in Louw et al. (2008) that a single data point in 
a data set can potentially have a substantial influence on the error rate of the KFDA 
classifier calculated from the data set. This is not only true for mislabelled cases: in 
Sect. 5 we saw for some of the configurations which were studied that even in the 
presence of a mislabelled case deliberately inserted into the data there may often be 
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data cases present in the data which affect the error rate of the KFDA classifier even 
more detrimentally than the mislabelled case. Clearly it is a worthwhile objective to 
develop criteria which may be used to identify such influential data cases (cf. Louw 
et al., 2008). However, since these criteria are calculated on a leave-one-out basis, 
computational issues arise when dealing with large data sets. In this paper we find 
that our proposal to use the smallest enclosing hypersphere as a filter, successfully 
reduces the computational burden, while still resulting in lowered error rates. 

There are several avenues for further research. Probably the most obvious issue 
requiring attention is identifying groups of influential cases rather than only a single 
point at a time. In this context the issue of masking should also receive attention. 
Another problem deserving attention is the development of critical values for the 
different criteria which could be used to decide whether the data case identified 
as being most influential in a data set should indeed be considered atypical. Fur- 
thermore, it would be a definite contribution if, in addition to the simulation-based 
evidence presented in this paper, one could derive theoretically results regarding 
the potential influence of data points on the classification efficiency of KFDA 
(cf. Croux, Filzmoser, & Joossens, 2008 for such results in the case of robust LDA). 
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Self-Organising Maps for Image Segmentation 



Ron Wehrens 



Abstract Self-organising maps (SOMs) have been applied in many different areas 
of science. In a typical application, large numbers of objects (thousands or more) are 
mapped to a two-dimensional grid of units in such a way that very similar objects 
end up in the same unit, and that neighbouring units are more similar than far-away 
units. The similarities of the individual units can be used in visualisation of the data 
by choosing appropriate colour schemes. Examples from image segmentation will 
show the usefulness of this approach. 

Often, additional information is available, e.g., class information, or measure- 
ments of a different nature. To take this extra information into account, we have 
extended the basic principle of SOMs to accommodate extra layers, one for each 
data modality. The closest unit is then given by a weighted sum of per-layer 
distances. The result is an overall better mapping, incorporating all available infor- 
mation. This is implemented in an R package “kohonen”. 

Keywords Data fusion • Self-organising maps ■ Supervised mapping ■ Visualisation. 



1 Introduction 

Self-organising maps (SOMs, Kohonen, 2001) have found application in many dif- 
ferent fields of science. Their principal use is in projecting large multivariate data 
sets to a two-dimensional grid of units, each characterised by so-called “codebook 
vectors”. After the map has been trained (see below), these codebook vectors play 
the role of archetypical objects, and the complete set of codebook vectors in a sense 
covers the space of the data set. The projection of the data onto the map, known 
as “topographic mapping” in the SOM community, is achieved by assigning every 
object to the unit whose codebook vector is most similar. Such a map makes it easy 
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to inspect the data for finding relations between objects: similar ones will be mapped 
close together, or even in the same unit. The technique is especially useful with large 
numbers of objects, since it is not necessary to calculate all inter-object distances. 
Rather, the distances between the objects and the codebook vectors are calculated, 
which means a typical reduction in calculations of several orders of magnitude. 

This paper shows improvements in the use of SOMs in the field of multivariate 
image segmentation. Pixels, each consisting of values for several spectral variables, 
are mapped to a SOM. By presenting the pixels in the original image using a sepa- 
rate colour for every unit, a segmented image is obtained. The first contribution of 
this paper shows smooth colouring schemes immediately conveying the similarity 
of different segments in the image. This is illustrated using MRI images of brain 
tumour patients. The second aim of this paper is to illustrate the potential of super- 
vised mapping, in which additional class knowledge is utilised (Meissen, Wehrens, 
& Buydens, 2006). 

In the next section, theory, data and software will be presented. The paper pro- 
ceeds with two sections on SOMs for image segmentation, covering methods for 
improving interpretability and supervised mapping, respectively. It concludes with 
a discussion highlighting further applications and possibilities. 



2 Materials and Methods 
2.1 Theory 

The theory of SOMs has been described in numerous books and papers and will 
be reviewed here only very briefly. Basically, the training algorithm is that of a 
k-means clustering, with an added spatial smoothness constraint (Ripley, 1996). 
In this context, the codebook vectors play the role of the cluster centers. They are 
initialised randomly. In each iteration (i.e., a mapping event of one object in the data 
set), the unit whose codebook vector is most similar to the new object, the “winning 
unit”, and its neighbours are updated as weighted averages of the old codebook 
vectors and the newly mapped object. Both the weight of the new object, and the size 
of the neighbourhood are decreased during training, so that in the latter phases only 
the winning units are updated, with very small refinements. Objects are presented in 
random order, typically several hundreds of times. Although this training procedure 
can be time consuming, especially for larger data sets, subsequent projection of 
new objects is very fast. For most practical applications (say, up to 10® objects) the 
current approach is very well feasible on a simple desktop computer. 

Supervised mapping, e.g., taking into account class information of individual 
objects, has obvious advantages. It can produce a better mapping because more 
information is being used, and as a consequence the set of codebook vectors 
provides a more accurate coverage of the experimental space. The first attempts 
at supervised mapping just added class information as extra variables (Kohonen, 
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2001). For relatively simple problems with low numbers of variables this works rea- 
sonably well, but the approach leads to problems when the numbers of variables 
in the original data and the additional information differ substantially. As a result, 
the distances between objects and codebook vectors are completely dominated by 
one of the two data domains. Earlier, we have presented an approach (Meissen et 
ah, 2006) to combine the two data domains explicitly by using a combined distance 
measure: 



The overall distance of object o to the codebook vector of unit m is a weighted 
average of the distances in the two data domains and £> 2 . The latter are calculated 

separately, and are scaled in such a way that the maximal element of both Di and 
£>2 equals 1. This takes away the effect of different scales in the two domains. This 
method, termed “XYF” (for X-Y fused maps), can provide an intrinsically better 
mapping, in the sense that more information is represented. This has been illustrated, 
e.g., in the area of crystal structure research (Willighagen, Wehrens, Meissen, de 
Gelder, & Buydens, 2007). Finally, one can generalise the approach to any number 
of different data layers: 



where each layer, representing a different view on the data, is given a separate 
weight a, (Wehrens & Buydens, 2007). This opens the way to using all available 
information in a simple, intuitive manner. 



2.2 Data 

Two data sets will be used to illustrate the basic concepts. The first is an MRI 
data set of a patient with a large brain tumour, measuring 256 times 256 pix- 
els. Background pixels have been removed to yield an image of 36,294 pixels. 
Every pixel is described by four variables, the four different MRI images available. 
These correspond to a proton density image, a Tl- and a T2-weighted image, and 
a gadolinium-enhanced image, respectively. A similar data set has been addressed 
earlier in Wehrens, Buydens, Fraley, and Raftery (2004). 

The second data set is a polarimetric SAR image of a part of Flevoland, an 
agricultural part of The Netherlands. The image has 400 times 400 pixels, with 
each pixel consisting of 18 variables (Hoekman & Vissers, 2003; Thanh, Wehrens, 
Hoekman, & Buydens, 2005). After applying a mask to cover roads and mixed areas, 
105,397 pixels, divided over six vegetation types, are present in the data set. This 
is randomly divided into a training set of 10,975 pixels (lying in a rectangular area) 
and a test set of 94,422 pixels, in such a way that all six classes are represented in 
both training and test sets. 



D{o, u) = aD\{o, m) -b (1 — a)£> 2 (o, u). 
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2.3 Software 

All analyses are done in R (R Development Core Team, 2008), using the “kohonen” 
package (Wehrens & Buydens, 2007), both available from the central repository 
CRAN 

http : //cran . r- proj ec t . org 

Add-on functions for the smooth colour map and triangular SOMs have been written 
which are available in the current version of the kohonen package (2.0.5). 



3 SOMs for Image Segmentation 

The goal of (multivariate) image segmentation is to group pixels of similar inten- 
sities together. In clinical decision making, for instance, MRI images are routinely 
used to determine the exact location and size of brain tumours, and in some cases 
even the type of tumour. An example is shown in Fig. 1 , where four different types 
of MRI data are depicted. It is not easy for the clinician to mentally combine the 
four images and process all gray levels; a segmented image with a limited number 
of colour classes is much simpler to interpret. 

The left plot in Fig. 2 shows the mapping of the individual pixels to a four-by- 
four SOM. The segmented image on the right is obtained by plotting every pixel 
with the colour of its SOM unit: the result is much more easy to interpret than the 
four MRI images. The large tumour behind the left eye is clearly visible. Note that 
we used a very small SOM to limit the number of colours in the segmented image; 
an alternative would be to cluster the individual units in the SOM according to their 
similarities (e.g., Tasdemir & Merenyi, 2006), or use Ultsch’ U-matrix methodology 
(Ultsch, 1993) and variants thereof. What is not so clear, however, is the relation 
between the different tissue types. In this case, the colouring is a rainbow palette 
from the first unit in the bottom left corner to the last, in the top-right corner, in 
row-wise fashion. Using a colouring scheme for the units that is spatially smooth in 
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Fig. 1 Four types of MRI data: from left to right a proton-density image, a T1 -weighted 
image, a T2-weighted image and a gadolinium-enhanced image. Dark gray indicates low values; 
background outside the skull has been removed 
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Fig. 2 Left plot: mapping of pixels to a four-by-four SOM. Individual pixels are indicated with 
black dots at random positions within the units. Right plot: segmented image, by colouring each 
pixel with the colour of the con'esponding SOM unit 



all directions would lead to a much more obvious interpretation of the colours in the 
segmented image. 

To illustrate the concept, consider a triangular SOM, rather than the usual rect- 
angular shape, as shown in Fig. 3. By assigning the three basic colours red, green 
and blue to each of the three corners, one can obtain a continuous colour gradient in 
the map. Computationally, this is achieved by using the relative row numbers (val- 
ues between 0 and 1, inclusive) as intensities for one of the colours, and repeating 
the operation twice after rotating the map by 2 :tt/ 3 and 47t/3, respectively, for the 
other two colours. Using other values for the rotation angles leads to other colour 
schemes, which all share the spatial smoothness that makes the plots easy to inter- 
pret. Two examples, depicting the same segmentation as the right plot in Fig. 2, are 
shown in Fig. 4. The tumour is very clearly visible, for the most part because it is 
mapped to one of the vertices of the triangle. 
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Fig. 3 Top row, from left to right: individual colour weights for the three comers, obtained from the 
row heights after rotation with Till'S, 0, and — 27 t/ 3 degrees, respectively. Bottom plot: resulting 
RGB-coloured map 




Fig. 4 Left: standard three-colour plot with red, green and blue as the most extreme values. Right 
plot: custom plot using jr/6 and —iilfi as rotation angles in obtaining the extreme colours. The 
segmentation in these plots is identical to the one in Fig. 2 



One of the well-known disadvantages of SOMs is the variability in mapping, 
introduced by the random initialisation. Repeated training may lead to very different 
results, which may be of comparable quality in the case of multiple local optima. 
The proposed colour scheme provides a way to normalise the colouring to some 
extent. One can rotate and mirror the map in such a way that the least populated 
corner is at the top, and the most populated corner of the map is in the lower left. 
In Fig. 5, the top row shows five replicated mappings starting from different random 
seeds. Although the tumour is visible in all five segmented images, it is hard to 
see whether all images in essence are similar, or whether differences exist. The 
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Fig. 5 Top row: five repeated mappings. Since cluster colouring is dependent on the initialisation, 
it is hard to compare the results. Bottom row: colouring ordered according to population of the 
extreme units. White: least populated unit; dark blue: most populated unit 



bottom row shows the ordered version, arranged in such a way that the white corner 
contains fewest mapped objects, and the dark blue corner contains most. Clearly, 
the mappings are very similar, which is much clearer from the bottom row of figures 
than from the top row. 



4 Supervised SOMs 

Very often, SOMs are used in property prediction, such as class membership. 
All objects mapped to the same unit are expected to have the same class, usu- 
ally determined by majority voting of the class of the training objects. Obviously, 
incorporating class information in training the map will lead to codebook vectors 
reflecting more knowledge, and a more realistic mapping. The eventual prediction 
will be done by projecting new data on the trained map, without taking into account 
class information - this may not even be available. The class associated with the 
winning unit, based on the class of the objects in the training set, will then be 
assigned to a newly mapped object. 

To show the potential of supervised mapping. Fig. 6 shows the ground truth of the 
Flevoland data set, the division in training and test sets, and the results of applying 
supervised and unsupervised mapping, respectively. Both supervised and unsuper- 
vised mapping use a 20-by-10 grid of hexagonally oriented units. Ten repeated 
mappings have been performed, starting from different initializations; the results 
shown here are the overall best results for both the unsupervised and supervised 
mappings. 

Boxplots of the prediction results for ten repeated mappings, depending on the 
parameter a, are shown in Fig. 7. The value a = 0 equals unsupervised mapping, 
i.e., SOM. Clearly, the supervised SOM version is insensitive to the exact value of 
a: all non-zero values lead to significantly better predictions. Especially the Barley 
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Ground truth Training set 




Fig. 6 Starting from the top left in clockwise order: the ground truth of the Flevoland data set; 
the division in training and test sets; class prediction of the test set based on supervised mapping 
(“XYF”) and on unsupervised mapping (“SOM”) 



class profits from the supervised mapping: rather than the meagre 68% correct for 
the (optimal) unsupervised case, the overall best XYF leads to more than 8 1 % cor- 
rect classifications. The improvement is most visible in the large barley patch that is 
slightly left above the center of the picture. 

The class predictions for individual units in the maps are depicted in Fig. 8. 
Although both in the supervised and unsupervised case the classifications are rela- 
tively clear-cut, the supervised mapping shows fewer “mixed” units; moreover, the 
Barley class is represented as one contiguous area within the map, where in the 
unsupervised case the Barley class is split in two by the Winter wheat class. The 
spectral features of these two classes are very similar; adding the class information 
enables the map to make a better estimate of the true class means and to generate 
better predictions. 
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Fig. 7 Boxplots of percentages of correct prediction for various values ofa:a = 0.0 con'esponds 
to the regular, unsupervised SOM. Clearly, including class information in the mapping leads to 
significantly higher prediction rates 



SOM 




■ Sugar beet □ Rapeseed ■ Peas 

□ Winter wheat □ Potato O Barley 



XYF 




■ Sugar beet 


□ Rapeseed 


■ Peas 


□ Winter wheat 


□ Potato 


01 Barley 



Fig. 8 Segment plots for the crop types associated with individual units in the map. Unsupervised 
mapping is shown on the /eft, supervised mapping on the rig/ir. Colours match those in Fig. 6 



5 Discussion 

In multivariate image segmentation, the aim is to summarise information of several 
(sometimes many) images into one colour-coded image, where each colour cor- 
responds with a distinct class. In most cases, the number of classes is unknown. 
One usually chooses to err on the safe side and to take too many classes. In the 
hnal interpretation of the hgure, several classes then have to be merged. By using a 
colour scheme that reflects similarities between different classes, it is much easier to 
interpret the result and to see which classes are similar. This has been demonstrated 
in this paper using triangular-shaped SOMs for mapping multivariate MRI images. 
A subsequent improvement is the potential to rotate the maps in order to get simi- 
lar colour-codings in repeated training runs. Since SOMs are initialised randomly, 
the result may look dramatically different but in reality be quite similar. A simple 
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ordering of colours according to the population of objects in corner units leads to a 
much more consistent view. 

SOMs have been used in multivariate image segmentation before; Li and Chi 
(2005), e.g., apply SOMs to approximately the same types of MRI data that are 
used here, but use a Markov Random Field model to describe spatial constraints. In 
contrast, our approach uses only the measured data, similar to the approach taken in 
clustering papers such as Fraley, Raftery, and Wehrens (2005). An example of the 
use of SOMs in analysing multivariate remote sensing data can be found in Villmann 
and Merenyi (2001). 

Supervised mapping, taking into account class information, is shown to lead to 
an improved mapping, in the sense that different classes are more compactly rep- 
resented in the map; moreover, prediction rates are consistently higher than with 
regular SOMs. Another example showing this even more dramatically is shown in 
Willighagen et al. (2007). Individual unit classifications are less ambiguous, and as 
a result, class predictions improve as well. 

Of course, one does not have to stop at adding one extra layer of information, 
in this case class information: one can extend the principle to N maps, to allow for 
more, complementary, information sources. In Wehrens and Buydens (2007), this is 
called “Super-organised maps”, and the example is given of a yeast gene set, where 
genes have been synchronised in a number of different ways. Mapping genes using 
four of these arrestation methods was shown to lead to much better classification 
results. For constructing an even better classifier one could additionally include class 
information. 

The main goal of the combination of several data entities, however, is to improve 
the mapping. If this is achieved, even when hard classification success rates do not 
improve greatly, major progress is made: SOMs and relatives should not be seen as 
an off-the-shelf classification tool but rather as a way to obtain insight in the data, 
something that is increasingly important. 
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Abstract Next generation postal sorting machines reuse once extracted mail piece 
addresses in different sorting steps by means of the mail piece image. Based on the 
mail piece uniqueness, characteristics derived from the image guarantee the assign- 
ment of stored addresses. During the first sorting step mail piece characteristics are 
extracted and stored together with the target address in a database. In subsequent 
sorting steps the address is accessed by determining the corresponding mail piece 
characteristics in the database. Appropriate mail piece image characteristics and 
procedures for their distance measurement were presented in a previous work. 

Image based mail piece identification poses a challenge by a constantly changing 
and non-deterministic mail spectrum and the differentiation of nearly identical bulk 
mail. In particular, the rejection of unknown mail pieces requires the definition of 
carefully chosen rejection classes depending on the current mail spectrum. 

In this paper we present an approach for distance based mail piece identifica- 
tion using a two-stage classification process. Bulk and private mail are handled 
individually by an unsupervised learning process which clusters similar mail piece 
characteristics. Based on these clusters specific rejection classes can be estimated 
within each cluster. The first step in the identification process is the determination of 
the corresponding cluster for a given mail piece. Using the cluster specific rejection 
classes a mail piece is either identified or rejected. Experimental results obtained on 
real-world data sets prove the applicability of the proposed method. 

Keywords Document identification ■ Unsupervised learning ■ Adaptive rejection 
criterion. 



1 Introduction 



In postal automation mail pieces are mostly automatically processed. While passing 
different sorting machines, mail pieces are sorted according to the delivery route. 
Each single sorting run requires the recipient address information captured either 
automatically or manually from the mail piece image. In particular, manual address 
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reading is expensive. In order to reduce these costs significantly, the address infor- 
mation is extracted only once in the first sorting run, stored in a database and reused 
in later sorting runs. Currently, the stored address information is associated with 
a unique key to access a specific mail piece entry. For mail piece assignment this 
key is printed as barcode on the mail piece surface. If the stored address is needed, 
the barcode imprint is read to load the corresponding database item. Undesired sur- 
face modifications like labels and the barcode imprint are a significant drawback 
of this method. Furthermore, additional costs are caused by hardware like the bar- 
code printer and the label applicator as well as running costs for ink and labels on 
poly-wrapped mail pieces. Image based mail piece identification overcomes these 
drawbacks. Due to the unique surface of a mail piece, image characteristics are 
used as unique mail piece key. Thus, the stored address information is accessed by 
identifying the associated stored mail piece characteristic set. 

In Worm and Meffert (2008) we have presented a procedure for image based 
document comparison focusing on mail piece comparison. Characteristics derived 
from document text regions and their mutual relations are used as document fea- 
ture set. Based on the calculated feature set, distances to stored document feature 
sets are determined and the final distance ranking is evaluated. In this paper this 
approach is extended to document identification and in particular to mail piece iden- 
tification based on the calculated distances. Errors in mail piece identification cause 
sorting errors which increase the delivery costs. In order to prevent identification 
errors and to reject unknown mail pieces, a two-staged identification process using 
unsupervised learning is introduced. 

This paper is organized as follows. Section 2 states briefly the applications 
requirements followed by an overview on the proposed approach in Sect. 3. Mail 
piece describing features are outlined in Sect. 4. The mail stream analysis and cal- 
culation of its characteristics are described in Sect. 5. After that. Sect. 6 gives an 
outline of the identification process itself. Finally, experimental results are presented 
in Sect. 7 and a conclusion and outlook is given in Sect. 8. 



2 Motivation 

Mail piece identification corresponds to image based document identification tasks 
like document duplicate detection (Doermann, Li, & Kia, 2003), document retrieval 
(Hu, Kashi, & Wilfong, 1999) or document image matching for revision detection 
(van Beusekom, Shafait, & Breuel, 2007). However, the requirements for image 
based mail piece identification in postal sorting machines are different in terms of 
classification, error prevention and the feasible runtime. 

In order to identify documents based on their image characteristics different 
solutions are already known. In Hu et al. (1999) and Peng, Long, and Chi (2003) 
a classifier trained in a supervised learning process is used for document classi- 
fication. The necessity of a labeled training data set discards this method for the 
proposed application due to the unknown and continuously changing composition 
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Fig. 1 Different examples of private mail and bulk mail where bulk mail of one type differs in its 
address only 



of the mail stream. The minimum distance classifier used in Peng, Long, Siu, Chi, 
and Feng (2000) for document identification within a database constitutes a more 
adequate solution for the proposed application. Moreover, the prevention of identifi- 
cation errors as well as the rejection of unknown mail pieces requires the definition 
of a rejection criterion. The discontinuous mail stream containing private and bulk 
mail poses a challenge for the definition of such a criterion. Generally, private mail 
differs in its rough layout, whereas bulk mail of one type differs in its address only 
(Fig. 1). Thus, characteristics of private mail typically has large distances to each 
other while bulk mail characteristics of one type present marginal differences only. 
Since the type of a mail piece is not known in advance, a global rejection criterion 
would be either too rough for bulk mail rejection or too poor for private mail rejec- 
tion. Therefore, rejection criteria have to be calculated dynamically based on the 
current mail stream composition. 

Finally, the mail sorting process is subject to severe runtime restrictions. The 
available processing time is less than one second per mail piece to guarantee fast 
mail piece sorting. Limitation of the mail piece search area should ease mail piece 
differentiation and lead to a runtime optimization. 

Considering these facts an adaptive approach is needed which effectively reduces 
the search area for a given mail piece. Moreover, the proposed approach stresses 
determining the final mail piece identification result while reducing identification 
errors. 



3 Approach 

Image based mail piece identification is subdivided into two processes - mail piece 
registration and mail piece identification. Mail piece registration corresponds to 
the first sorting run (Fig. 2) which captures image characteristics and the recipient 
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Fig. 2 System overview on image based mail piece identification including mail piece registration, 
search area consolidation and mail piece identification 



address from the mail piece scan. Both datasets are stored together in a database for 
later mail piece identification. In further sorting steps the mail piece address will be 
determined from the database. Based on a new mail piece scan, image characteris- 
tics are calculated again and compared to the stored ones. If the corresponding mail 
piece data set is determined, the included address information can be used for mail 
piece sorting. 

In order to optimize mail piece identification and determine mail stream adap- 
tive rejection criteria, the current mail mix has to be analysed before mail piece 
identification. For this purpose, the additional Search Area Consolidation process 
is introduced in the proposed approach (Fig. 2). Located between mail piece regis- 
tration and identification, this process examines a given search area with respect to 
the occurrence and distribution of different bulk mail types as well as private mail. 
Furthermore, dedicated rejection criteria are derived for the given search area which 
are applied in the subsequent mail piece identification process. 

Based on mail piece layout and detailed characteristics calculated during mail 
piece registration, the Search Area Consolidation process consists of two steps. 
Firstly, the appearing mail piece types - private mail or different types of bulk 
mail - are identified within the given search area. Similar to redundancy detection 
in document databases (Foo, Zobel, & Sinha, 2007) or document categorization by 
similar layouts (Hu et al., 1999), an unsupervised learning process is employed to 
analyse and group the unknown, steadily changing mail piece types. Since mail 
pieces of one type resemble in their spatial layout, bulk mail of one type represent 
a cluster in the feature space. Thus, the search area can be organized in clusters of 
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similar mail pieces using spatial layout describing features and a clustering algo- 
rithm. This procedure leads to a simplification and acceleration of the subsequent 
mail piece identification process. On the one hand, the time-consuming distance 
calculation of detailed mail piece characteristics used to meet the final identification 
decision is limited to the mail pieces included in the associated cluster. On the other 
hand, this preselection refuses ineligible mail piece candidates and reduces potential 
identification errors. In the second consolidation step the cluster specific rejection 
criteria are derived from detailed mail piece characteristics. Therefore, the expected 
mail piece distances are analysed by statistical methods to estimate rejection criteria 
for the included mail pieces. 

Using the current mail mix characteristics determined in the Search Area Con- 
solidation process, mail pieces can be identified in the identification process in two 
steps. In accordance to mail piece registration, its characteristics are extracted from 
a new image scan. At first, the corresponding mail piece cluster has to be determined 
using spatial layout describing features. After that, the optimum fitting mail piece 
candidate is identihed within a cluster by means of detailed mail piece describ- 
ing features. Depending on the specific rejection criterion the candidate is either 
accepted or rejected. 



4 Feature Extraction and Comparison 

Mail piece image characteristics aim at a compact mail piece representation describ- 
ing any mail piece type. In the proposed approach two kind of features are intro- 
duced for image based mail piece identification. Mail piece layout features f, 
roughly separate the appearing mail piece types, while detailed local features fd 
facilitate final mail piece identification or rejection: 

Rough mail piece layout features (fr). For feature extraction the mail piece grey 

level image Ig is subdivided into n equally sized blocks. The standard deviation 
a of the occurring grey values is calculated for each block i and arranged into a 
mail piece fixed length feature vector fr: 

fr = (ai.aa, . . . ,CT„). (1) 

Finally, the similarity of two feature vectors frj and fr 2 is calculated efficiently 
by the Manhattan-metnc Sm- These statistical features derived from the mail 
piece image guarantee a rough, global mail piece layout description and meet the 
runtime requirements in the application processes. 

Detailed local mail piece features (fd). The final mail piece candidate is identi- 
fied by means of detailed features focusing on the occurring mail piece text 
regions. Text describing features and their spatial relations are represented in 
an attributed relational graph and compared by a complex metric 8d as described 
in Worm and Meffert (2008) in detail. 
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So, each mail piece is represented by two characteristic sets f, and fd captured in the 
registration as well as the identification process. 



5 Search Area Consolidation 
5.1 Mail Stream Analysis 

Since bulk mail of one type has a similar layout, it represents a cluster in the feature 
space. Unsupervised learning methods like clustering aim at automatic grouping of 
such unlabeled objects with similar characteristics. For this reason, they are used 
in the proposed approach to facilitate a different handling of the appearing mail 
piece types in the subsequent identification process. However, unsupervised learning 
applied to mail piece clustering in a postal sorting process has to comply with the 
following requirements: 

Independence of Domain Knowledge. The unknown contents of a given mail 

piece search area prevents the usage of parameters which specify the number 
of expected clusters directly or indirectly by a threshold. Furthermore, human 
verification of cluster results has to be avoided due to large amounts of mail 
processed in a typical postal sorting center every day. 

Independence of the Mail Piece Sequence. In order to guarantee a robust cluster- 
ing result, the algorithm has to be independent of the mail piece order within a 
given search area. 

Accuracy. Minimizing the risk of mail piece misidentification or false rejections 
requires a high accuracy of the clustering result. Mail pieces which are not 
assigned to their corresponding cluster will be either rejected or might even cause 
a sorting error in the identification process. 

Cluster Results. The identification process expects a search area divided into dis- 
joint subsets. Thus, each mail piece has to be assigned exactly to one cluster and 
each cluster has to consist of at least one mail piece. 

Runtime. The postal application enforces a strictly limited algorithm runtime. 
Hence, computationally expensive methods which require multiple clustering 
runs are not applicable. 

Considering these prerequisites a hybrid method combining an agglomerative, 
hierarchical and a partitioning cluster algorithm is employed. 

Cluster algorithms require a distance metric for feature comparison and a dis- 
tance metric for cluster comparison. In order to detect the appearing mail clusters, 
each mail piece is represented by its layout features f, compared by the Manhattan- 
metric Sm (Sect. 4). Using this metric, in the proposed approach two clusters Ca and 
Ch are compared by the robust Vfhrd-linkage: 




( 2 ) 
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where fr^ and frj correspond to the cluster medoids. On this basis, an agglomerative, 
hierarchical clustering (Kaufman & Rousseeuw, 1990) is performed at first. This 
method is independent of parameters like the expected number of clusters or the 
specification of an initial partitioning. Furthermore, the method is invariant towards 
the mail piece order. Initially, each mail piece characteristic set f, is regarded as 
one cluster. In each cluster step the two most similar clusters determined by (2) are 
merged. So, iteratively a cluster hierarchy is constructed until all mail pieces are 
joined into one final cluster. 

Based on the linkage hierarchy comprising M merge steps, the optimal clus- 
ter number can be estimated (Milligan & Cooper, 1985). Therefore, the calcu- 
lated merge distances are normalized by their mean and their standard 
deviation a„: 

= (3) 

(7w 

The number of the appearing mail types within a search area corresponds to the 
cluster step i fulfilling the condition 8'^. < i?, where is empirically determined. 

Finally, for cluster refinement and cluster accuracy optimization the Partitioning 
Around Medoids-algonthm (Kaufman & Rousseeuw, 1990) is used. Clusters are 
readjusted by means of the cluster medoids determined in the previous hierarchical 
clustering process. For cluster identification in the subsequent mail piece identifi- 
cation process, each cluster is represented by its medoid and consists of equally 
structured mail pieces which ideally comprises mail pieces of one type. 



5.2 Rejection Criteria Estimation 

The rejection possibility of a decision is of great importance for image based mail 
piece identification. Identification errors increase the sorting costs by wrong deliv- 
eries while rejected mail pieces are delayed due to further sorting runs. Hence, 
rejection criteria have to be defined in a way that they guarantee an optimal trade-off 
between accepting results and avoiding misidentifications. 

In case of unlabeled training data sets, the definition of an empirically rejec- 
tion threshold &r is a well known solution and limits the acceptance of a candidate 
distance i5 by 5 For the proposed application such a threshold is ineligi- 

ble, because the different mail piece types present a different rejection behaviour. 
Depending on the mail piece complexity and differently severe address variations, 
the appearing mail piece distances within a cluster can distribute in a different way. 
Clusters presenting small distance variations require a smaller rejection threshold 
than clusters with large distance variations. 

In the proposed approach local, adaptive thresholds for each mail piece are esti- 
mated contingent upon the mail piece cluster contents. Assuming a nearly ideal 
system, a mail piece related rejection threshold &,■ corresponds to the distance i5 of 
its nearest registered neighbour within a cluster. In order to increase the robustness 
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towards image deviations between the registration and the identification process, 
thresholds are adapted depending on the standard deviation of distances a within a 
cluster. Hence, utilizing the mail piece characteristic set fd (Sect. 4) for final can- 
didate identification, mail piece related rejection data p, are derived for all N mail 
pieces within a cluster: 

^ = min {&d (fd,- - fd, ) \ i, j & [0, N - \],i ^ j ] 

( a; = a (fd,,fd,) I e [ 0 , Af - l],i 7 ^ 7 } . 

In the subsequent identification process p, is used to predict the distance relation of 
the final mail piece candidate and its nearest neighbour. A similar distance relation 
calculated in the identification process leads to the candidate acceptance, while a 
different resulting distance relation causes a rejection. 



6 Mail Piece Identification 

Using the stored mail piece characteristic sets as well as the search area charac- 
teristics, the identification decision for an unknown mail piece characteristic set 
is divided into two steps. At first, the corresponding mail piece cluster has to be 
determined. According to the clustering process, the mail piece layout character- 
istic set fr is compared to all cluster prototypes. The target cluster is selected by 
the minimum distance classifier as well as the underlaying Manhattan-mstric Sm- 
Thus, the mail piece candidates are limited fast to the mail pieces of one appearing 
mail type. 

Within a cluster, the final mail piece candidate is determined by its detailed, 
local characteristic set fd. The evaluation of the preferred final mail piece candidate 
is mapped on a minimum distance classification with the underlaying distance met- 
ric Sd- In the proposed approach this classifier is extended with a rejection option. 
Assuming that the actual distance relation between the preferred mail piece can- 
didate and its nearest neighbour acts like the predicted one, a feasible acceptance 
criterion can be defined by 

Sd, < < Sd„ (5) 

where Sd, represents the distance to the preferred mail piece candidate, Sd 2 the 
distance to its nearest neighbour and !?i the predicted threshold. 

Considering image deviations between the registration and the identification 
process, the belonging extracted mail piece characteristics vary. In this case the 
strict criterion in (5) causes false rejects. In addition to the estimated mail piece 
based threshold )?i , the actual feasible distance depends on the appearing mail mix 
within a cluster. In particular, private mail differ more than nearly identical bulk 
mail. Thus, they permit a higher acceptance criterion than bulk mail with marginal 
address deviations only. Assuming that less distance variances indicate nearly iden- 
tical bulk mail whereas high variances indicate private mail, the expected mail piece 
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related distance variance is used to enhance the adaptive acceptance criterion (5). 
Based on the mail piece related rejection data set p\ derived from the registration 
characteristic sets (Sect. 5.2) for the preferred mail piece candidate, a candidate is 
considered as corresponding mail piece if and only if: 

- Cl * ax) < -&X < {&d2 + Ch * Ox) \ ci,Ch ^ [ 2 .. 10 %] , ( 6 ) 

where the factors c; and c/, weight the feasible threshold deviation and are deter- 
mined empirically. 



7 Experiments 

The quality of the proposed approach is analysed in two respects. On the one 
hand the clustering results are analysed, on the other hand the final mail piece 
identification results are evaluated. The experiments are performed on a real data 
set comprising 5,000 mail pieces. In order to simulate the registration as well as 
the identification process, each mail piece is scanned twice. The data set is parti- 
tioned into typical, size varying subsets of up to 200 mail pieces according to the 
postal application. For result evaluation ground truth information which references 
corresponding mail piece scans are captured manually. 

In a first experiment the quality of the proposed clustering is evaluated. Based 
on the first mail piece scan the clustering is performed for each subset. In the iden- 
tification process the mail piece assignment to a cluster is analysed. Utilizing the 
captured ground truth information, the true and false positives are determined for 
each subset and are sorted in ascending order. The achieved results (Table 1) prove 
the quality and the reliability of the proposed algorithm. In respect of the subse- 
quent final identification step, the determination of the corresponding cluster is 
crucially important for the identification of the final candidate and influences the 
identification result. That means, the identification of a wrong cluster reduces the 
identification performance. 

In a second experiment the final performance is evaluated. Mail pieces are iden- 
tified or rejected within the determined cluster. In conformity with the cluster result 
evaluation, the true and false positives complemented by the mail piece rejections 
are determined using the ground truth information (Table 2). The results are promis- 
ing and show the applicability of the proposed approach. Remaining problems relate 
to the handling of special cases as well as the optimization of the underlaying local 
features. 



Table 1 Cluster identifica- 
tion results of different search 
areas represented by true and 
false positives 





True positives (%) 


False positives (%) 


1 St quartile 


100 


0 


Average 


99.68 


0.32 
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Table 2 Mail piece identification results for different search areas represented by true and false 
positives as well as rejections 





True positives (%) 


False positives (%) 


Rejections (%) 


1st quartile 


89.19 


2.70 


8.12 


Median 


92.22 


1.11 


6.67 


3rd quartile 


94.12 


0 


5.88 


Average 


92.02 


2.22 


5.75 



8 Conclusion and Outlook 

In this paper a two-staged document identification process using unsupervised learn- 
ing and dynamic rejection criterion estimation has been presented which focuses on 
image based mail piece identification. For postal applications identification accuracy 
as well as error prevention are of particular importance. However, the constantly 
changing and non-deterministic mail piece spectrum as well as the differentia- 
tion of nearly identical bulk mail complicate the definition of a global rejection 
criterion. 

The proposed approach efficiently combines rough layout and detailed local mail 
piece describing features to identify mail pieces within a given search area. Based 
on an unsupervised learning process similar mail pieces are detected and clustered. 
Thus, mail type specific rejection criteria are derived dynamically and used to make 
the final mail piece decision in the identification process. The applicability of the 
proposed approach has been proved in two experiments using real-world data. 

Future work will focus on optimizing local mail piece describing features and 
the handling of special cases in order to improve the identification results. 
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Part VI 

Statistical Musicology 



Statistical Analysis of Human Body Movement 
and Group Interactions in Response to Music 



Frank Desmet, Marc Leman, Micheline Lesaffre, and Leen De Bruyn 



Abstract Quantification of time series that relate to physiological data is chal- 
lenging for empirical music research. Up to now, most studies have focused on 
time-dependent responses of individual subjects in controlled environments. How- 
ever, little is known about time-dependent responses of between-subject interactions 
in an ecological context. This paper provides new findings on the statistical anal- 
ysis of group synchronicity in response to musical stimuli. Different statistical 
techniques were applied to time-dependent data obtained from an experiment on 
embodied listening in individual and group settings. Analysis of inter group syn- 
chronicity are described. Dynamic Time Warping (DTW) and Cross Correlation 
Function (CCF) were found to be valid methods to estimate group coherence of the 
resulting movements. It was found that synchronicity of movements between indi- 
viduals (human-human interactions) increases significantly in the social context. 
Moreover, Analysis of Variance (ANOVA) revealed that the type of music is the 
predominant factor in both the individual and the social context. 

Keywords Embodiment ■ Human body movement ■ Music research • Social inter- 
action • Statistical analysis. 



1 Introduction 



The analysis of human body movement is relevant for a number of research areas 
such as therapy and rehabilitation (Nayak, Wheeler, Shiflett, & Agostinelli, 2000), 
sports (Martin, 2008), bioinformatics (Buldyrev, Goldberger, Havlin, Mantegna, 
Matsa, et al., 1995) and neurology (Machulda, Ward, Borowski, Gunter, Cha, et ah, 
2003). Also in empirical music research, there is a growing interest in how the 
human body moves and responds to music (Castellano, Bresin, Camurri, & Volpe, 
2008; Bernhardt & Robinson, 2008; Thaut, Mcintosh, Rice, Miller, Rathbun, et al., 
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1996). However, the study of music-driven human body movement is complex 
because it has to deal with several factors that introduce variability on top of music- 
driven time varying data, such as the neural-muscular-skeletal variability among 
subjects, the variability in response patterns of single subjects due to learning and 
training, or the subjects’ background (gender, culture) (Stergiou, 2004). The present 
study relies on Leman’s model of music communication, which is based on the 
notion of embodiment (Leman, 2007). The human body is thereby considered as a 
natural mediator between mind and physical environment. In this paper, we focus on 
Leman’s social factor (Leman, Desmet, Styns, Van Noorden, & Moelants, 2007) of 
the above music communication model by studying music-driven body movement 
of a group of people whose social interaction is taking place in ecological condi- 
tions. Music is thereby seen as a social phenomenon and the quantification of social 
interaction is considered to be a key factor for the development of future electronic 
mediation technologies and applications. So far, most studies on music-driven body 
movement have been carried out in controlled laboratory conditions, often with sin- 
gle subjects, limited to simple motor tasks (Toiviainen & Snyder, 2003; Boone & 
Cunningham, 2001). Although some studies have focused on group behavior, few 
studies have studied music-driven body movement in real life (ecological) envi- 
ronments (Clayton, Sager, & Will, 2004). In this paper we test the hypothesis that 
humans move more synchronous to the beat of the music and with each other in 
group than in an individual setting. To test the above hypothesis we rely on differ- 
ent methods yet all methods share a common approach, namely, the definition of 
similarity measures, which is then applied on the Multivariate Time Series (MTS) 
matrices. This paper is organized as follows: in Sect. 2 the experimental design and 
data considerations are reviewed. Sect. 3 deals with the analysis, discussion and 
conclusions are given in Sects. 4 and 5. 



2 Experimental Design and Data Considerations 

The experiment was carried out during the Accenta 2007 exhibition in Ghent, where 
groups of four subjects moved remote Wii sensors while listening to (recorded) 
music. Sixteen groups of four adolescents participated in the experiment (mean age 
sixteen). Audiences could watch the performances of these groups. Each group had 
to perform the task in two conditions, namely, an individual condition, where the 
participants were blindfolded, and a social condition, where the participants could 
see each other. Each group had to move in response to six pieces of music. Each 
piece lasted about 30 s. Eor a more detailed description of the experimental setup see 
De Bruyn, Leman, Moelants, Demey, and Desmet (2008); Demey, Leman, Bossuyt, 
and Vanfleteren (2008). The acceleration data from the remote Wii sensors were 
registered wireless on-line on a laptop computer via the Bluetooth protocol in a PD 
patch and sampled at a 100 Hz rate. Given the design of the experiment (16 groups, 
four participants per group, six musical excerpts, two conditions, three axes), this 
resulted in 2,304 time series with length N = 3,000 (30 sx 100 samples s“'). In 
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order to avoid the influence of hesitations and confusions at the beginning and the 
ending of the task, a 5-25 s interval of the time series was chosen for further anal- 
ysis, instead of the recorded 0-30 s. The x,y,z dimensions of the accelerometer of 
the Wii sensor were further reduced to one single dimension, using the formula 



where is the global value at time i , and Ux^ti) the acceleration value for dimension 
X. Inspection of the resulting time series revealed differences in the range of the 
accelerations (strong and weak responses). As the occurrence in time of acceleration 
changes is of interest in this analysis rather than the intensity the amplitude of the 
calculated accelerations was rescaled to an [0,1] interval. Due to the definition of the 
total acceleration the minimum value of the series is 0 hence rescaling was based on 
the division of the values by the maximum value in the corresponding time series. 



3 Analysis 

Before analysis, the time series were tested for stationarity, as it is well known that 
this condition has a great influence on the stability of correlation coefficients (Yang 
& Shahabi, 2005). The Unit Root test was used to investigate possible deviations 
from stationarity. It was found that this assumption could not be accepted in the 
majority of the MTS. Possible explanations for this observation are drift of the Wii 
sensors due to the end of the lifetime of the batteries or failing Bluetooth connec- 
tivity. Therefore, trend removal was used to obtain stationary MTS. Dynamic Time 
Warping (DTW) was then used in order to deal with small anticipations and delays 
in human movement (Parsons, 1987). In this study, we apply a multivariate DTW 
and a similarity measure based on the cost function. Constraints were introduced 
in order to speed up DTW calculations. A Sakoe-Chuba band (Sakoe & Chiba, 
1978) with a width of 100 was selected which accounts for a 2.5% range. A cumu- 
lative distance matrix was then used to find the optimal path, by applying dynamic 
programming. DTW was calculated for all six possible combinations in the MTS 
(4 X 2,001) series within each group. Figure 1 shows a fraction of the time series of 
1 group (four subjects) before (left panel) and after warping (right panel). 

The warped MTS were then inspected for normality of the residuals of the differ- 
ences between subjects within the group. It was found that, after warping, normality 
of the residuals could be accepted (Kolmogorov-Smirnov, a = 0.05), which was 
not the case for the original series. The benefit of DTW was investigated by compar- 
ing the cross correlations of the original and the corresponding warped series. In the 
example it can be seen that in the individual context Will, Wii3 and Wii4 move in a 
similar way whilst Wii2 shows a different pattern. Cross correlation on the original 
series show moderate to very low correlations with two non- significant values. The 
DTW data show higher and significant correlations in all inter subject combina- 
tions (Table 1). Each cell in this table represents the correlation (upper row) and the 
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Fig. 1 Original (left) and corresponding warped (right) accelerations (Group 1 Songl) 



Table 1 Example of coirelation and significance values in individual and social conditions 





Wiil 


Wii2 


Wii3 


Wii2 


Wiil 


Wii2 


Wii3 


Wii2 


Wiil 


Wii2 


Wii3 


Wii2 


Wiil 


1 


0.012 

0.589 


0.565 

0.000 


0.368 

0.000 




1 


0.228 

0.000 


0.711 

0.000 


0.558 

0.000 




1 


0.752 

0.000 


0.782 

0.000 


0.644 

0.000 


Wii2 




1 


0.030 

0.182 


-0.144 

0.000 




1 


0,258 

0.000 


0.224 

0.000 




1 


0.767 

0.000 


0.670 

0.000 


Wii3 






1 


0.434 

0.000 






1 


0.592 

0.000 






1 


0.613 

0.000 


Wii4 








1 








1 








1 



Original (Individual) Warped (Individual) Warped (Sdclal) 



corresponding significance (bottom row) between the subjects. We define the simi- 
larity between subject movements as S,; = (1 — Corr^ ) with S,y between 0 and 1, 
low values indicating a high (closer related) inter subject synchronicity. 

To obtain a measure for group coherence, correlations were averaged for each 
group, song and condition. A geometric representation of S,y can be used as a tool 
to classify groups. The plot is constructed by positioning the four participants of a 
group so that the distances (length of the lines) are proportional to the similarities. 
In the example (Fig. 2, left), the subjects handling Wiil, Wii3 and Wii4 move syn- 
chronous, while the subject that is handling Wii2 shows a different pattern in the 
individual condition. However, in the social condition (Fig. 2, right) the coherence 
improves (Wii2 moves along with the other participants). 

A plot of the obtained values (individual vs. social) indicates a possible nonlinear 
trend (Fig. 3). The coherence of a group seems to be proportional to the degree of 
difficulty of the song. For instance, song 4 was well known by the participants and 
had a clear beat. Even in the individual condition, subjects were able to synchronize 
very well with the music. Hence the improvement of the group coherence of the 
social condition was low in this case. On the other hand, song 3 was found difficult 
and unknown by the participants, resulting in a higher effect of the social condition 
on the group coherence. Univariate ANOVA analysis was used to investigate the 
effect of condition and song. Homogeneity of variance (Modified Levene, a = 0.05) 
and normality (Kolmogorov-Smirnov, a = 0.05) could be accepted. 

The test indicates that the type of song was the dominant factor while the effect 
of condition is rather weak but significant. No interaction (condition x song) was 
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2 



Individual 



Fig. 2 Geometric representation of within group similarities 
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Fig. 3 Social vs. individual correlations {dashed lines represent 95% interval) 



0.8 



observed. A Tukey analysis reveals that the songs can be grouped in three subsets 
(S3, S5), (S2) and (SI, S6, S4). Finally the DTW cost function was evaluated. Sev- 
eral cost functions have been proposed. For this study the total cost was based on the 
Euclidean distance of the corresponding warped (x, ,y^ ) pairs. Univariate ANOVA 
was used to estimate the effect of song and condition on the DTW cost. It was found 
that the warp cost depends mainly on the song and that there is a small decrease in 
the social condition except for song 4, which has the lowest cost. In order to test 
the validity of the above method a separate experiment was set up. The subjects 
of this experiment were bachelor students of musicology (average age 22) who did 
the same experiment as the Accenta setup in the laboratory at IPEM. The group 
coherences were calculated and compared with the results of the experiment. As 
only three groups were involved the results are only indicative but nevertheless they 
reveal some interesting information. First of all it can be seen that the social vs. 
individual coherences of the students are comparable with the results of the Accenta 
experiment (Fig. 4). 
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Fig. 4 Comparison of Accenta and student correlations 
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On the other hand, there are differences over the songs. Song 3 has much higher 
levels for subjects from musicology, than for the subjects of the Accenta experiment. 
This can be explained by taking into account that the students all have a musical 
education background and hence familiar with baroque music. Songs 1, 2, 4 and 6 
all show a high coherence for the students indicating a possible effect of age. Song 5 
shows an improvement but has low values when compared to the other songs. Song 
5 had the most complex rhythm and was influenced by oriental elements. An effect 
of cultural background may be a possible explanation. 

Alternatively, the human-music interaction was studied based on the amount 
of seconds the participants synchronized correctly with the nominal tempo of the 
music. This is calculated from the norm of the raw data for each block of 2 s by 
applying a fast fourier transform (FFT) over a 4-s moving window with a 2-s over- 
lap. The dominant peak in the fourier transform is identified and compared with the 
nominal beats per minute (BPM) of the excerpt for deciding on the correctness of 
the synchronization. Also the half and double of the nominal BPM were considered 
as correct. For more detailed information about this method see De Bruyn et al. 
(2008), Demey et al. (2008). Based on the obtained scores, the impact of a social 
context on synchronization was studied using ANOVA analysis. Homogeneity of 
variances (Modified Levene test, a = 0.05) and normality (Kolmogorov-Smirnov, 
a = 0.05) could be accepted. Results show that synchronization results of the par- 
ticipants are significantly higher in the social condition compared to the individual 
condition (ANOVA, a = 0.05). The main effects are visualized in an interaction 
plot in Fig. 5. 
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songs 

Fig. 5 Visualization of the mean synchronization results per song in the individual and social 
condition 



As can be seen in Fig. 5, the songs themselves have a great impact on the syn- 
chronization results. A multiple comparison Tukey analysis shows that participants 
score significantly lower for songs 3 and 5 than for songs 1, 4 and 6, while the 
results of song 2 are somewhere in between. This can be explained by the rhythmi- 
cal complexity of the songs: songs 1, 4 and 6 are pop songs with a very clear beat, 
songs 3 and 5 can be interpreted either binary or ternary, whereas song 2 can only 
be interpreted binary but has an unclear beat. 



4 Discussion 

For the MTS data in the experiment presented here, in which subjects were 
asked to synchronize with the beat of the music, identical results are obtained 
with the analysis of human-human synchronization based on DTW and the anal- 
ysis of human-music synchronization based on FFT (Fast Fourier Transform) 
(De Bruyn et al., 2008; Demey et al., 2008). This indicates the validity of the DTW- 
based method to analyze movements to music of multiple subjects, which can now 
be applied to the study of human-human interaction in more generic movements 
to music. Although the data were collected in an ecological setting with several 
unknown sources of variance, it was shown that the effect of song and condition 
can be quantified. The impact of the characteristics of the music is the predom- 
inant factor and this is in agreement with the model of musical communication 
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on which this study is based. Humans can indeed decode the intentionality of the 
music and translate the energy input into movements as a function of musical con- 
tent. Both in the individual and social context this can be quantified. In the social 
condition there is a benefit as a consequence of imitation effects during the social 
interaction. The results clearly show that group coherence improves when people 
move together. Whether this is only the result of direct human-human interaction 
or that other factors such as presence of public play a role, is not yet clear. The 
results of the group coherence measure are in good accordance with the analysis 
of the other variables derived from the collected data. Up to now it is not possible 
to separate the human-human interaction from the human-environment one in an 
ecological setting. In order to improve this type of experiment a new experimental 
design was proposed. In this design the subjects are not blindfolded in the indi- 
vidual condition, but separated using screens, and 10 songs with carefully selected 
properties are chosen. A drawback from the Accenta data was the lack of consistent 
information of the participants’ background. The pre-survey could not be used for 
analysis due to unbalanced results and information about the experience of the sub- 
jects during the experiment (post-survey) was not available. In future experiments 
the use of properly designed surveys need to be included in the experiments. Large 
scale user studies for use of the analysis of human-music relationships has been 
proven to be of great importance (Lesaffre, De Voogdt, Leman, Baets, Meyer, et al., 
2008). Finally additional measurements such as video analysis and more sensors 
per subject are recommended in order to refine the analysis. On the analytical level 
improvement of the applied techniques and the use of new methods will be investi- 
gated. At this moment Correlation optimized DTW based on PCA analysis (Tomasi, 
Berg, & Andersson, 2004) and quantification of complexity and determinism of the 
movement data are tested (Sarkar & Barat, 2006). Another important issue is the 
reduction of calculation time of DTW. Several methods will be tested in the nearby 
future (Dixon, 2005). 



5 Conclusion 

In this study different statistical techniques were tested for the analysis of human 
movements to music. Using DTW in combination with CCF and AN OVA it was 
found that the type of music is the dominant factor of the inter group movements 
as a response to music stimuli. The effect of condition is low but significant, even 
in the ecological setting of the experiment. Using DTW and CCF it is possible to 
quantify interactions and to classify groups by coherence. The outcome of this study 
enables also to define a statistical path as a tool for researchers and as a guideline 
for appropriate experimental designs for future research. 
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Applying Statistical Models and Parametric 
Distance Measures for Music Similarity 
Search 



Hanna Lukashevich, Christian Dittmar, and Christoph Bastuck 



Abstract Automatic deriving of similarity relations between music pieces is an 
inherent field of music information retrieval research. Due to the nearly unre- 
stricted amount of musical data, the real-world similarity search algorithms have 
to be highly efficient and scalable. The possible solution is to represent each music 
excerpt with a statistical model (ex. Gaussian mixture model) and thus to reduce 
the computational costs by applying the parametric distance measures between the 
models. In this paper we discuss the combinations of applying different paramet- 
ric modelling techniques and distance measures and weigh the benehts of each one 
against the others. 

Keywords Gaussian mixture models ■ Kullhack-Leibler divergence Music infor- 
mation retrieval ■ Music similarity. 



1 Introduction 

During recent years the scientific and commercial interest in automatic methods 
for revealing similarity relations between music pieces has tremendously increased. 
Stimulated by the ever-growing availability and size of digital music collections, 
music similarity has been identihed as an increasingly important means to aid con- 
venient exploration of large music catalogues. Evidently, commercial entities like 
online music shops and content aggregators have realized that so-called recommen- 
dation engines can significantly improve their unique sales point and foster customer 
loyalty. Providing the users with search functionality beyond conventional metadata 
like artist, title and album is a very effective tool to enable casual music consumers 
to broaden their musical horizon and discover new products. With the help of rec- 
ommendation systems, the average listener is neither forced to keep track of the 
newest releases via music magazines, nor does he need to pre-listen thousands of 
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songs in a row. Thus, automatic recommendation seems to pose a possible solution 
for the so-called long-tail phenomenon that has been recently raised in Anderson 
(2006). For well known mainstream music, tons of user generated browsing traces, 
reviews, play-lists and recommendations available in different online communities 
can be analyzed through collaborative filtering methods (Cohen & Fan, 2000) in 
order to reveal relations between artists, songs and genres. For novel or niche con- 
tent one obvious solution to derive such data is content based similarity search. 
Since the early days of Music Information Retrieval (MIR) the search for items 
related to a specific query song or a set of those (Query by Example) has been 
a consistent focus of scientific interest. Thus, a multitude of different approaches 
with varying degree of complexity has been proposed (Tzanetakis, Essl, & Cook, 
2001; Herre, Allamanche, & Ertel, 2003; Pampalk, 2006). Many publications have 
addressed suitable modelling methods that represent the musical gist whilst keep- 
ing the description blurry enough to account for small but irrelevant differences 
(Aucouturier, Pachet, & Sandler, 2005). With regard to the real-world applicability it 
becomes clear, that the human perception of music similarity as a subjective, context 
dependent, and multi-dimensional concept can not be modelled to the utmost extent, 
especially under large-scale conditions (more than 1,000 music items) (Aucouturier 
& Pachet, 2004). 

The usual practise of MIR algorithms is to use a compact representation of an 
audio signal derived in short-time signal snippets (frames). These representations 
(usually called “feature vectors” or “features”) are designed to correlate to some 
semantically meaningful properties of musical signal. Defining a similarity mea- 
sure between two audio signals which consist of multiple feature vector frames still 
remains a challenging task. As it was shown in the previous studies (Berenzweig, 
Logan, Ellis, & Whitman, 2003; Logan & Salomon, 2001), one of the possible 
solution is hrst to model the distribution of the feature vectors using a parametric 
statistical model, for example, such as Gaussian Mixture Model (GMM), and then to 
define a distance measure between the distributions via a distance measure between 
the models. In this paper we introduce novel types of statistical models based on 
song segmentation information. In particular, we show that even using the infor- 
mation contained in the most prominent segment of the song (e.g., “chorus”) gives 
reasonable results for music similarity search). We present the evaluation results 
of investigating the applicability of various distance measures depending on the 
applied statistical models. 



2 Feature Extraction 

Nearly all state-of-the-art music similarity techniques use acoustic features cal- 
culated in short frames. Each feature is designed to correlate with one of the 
aspects of perceptual similarity, e.g., timbre, tempo, loudness, harmony. Although 
distinct audio signals may possess audio properties, which can be captured only 
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Table 1 Low-level audio features used in the baseline system 



Feature 


Short Name 


Dimension 


Log loudness 


LogLoud 


12 


Norm loudness 


NormLoud 


12 


Mel-frequency cepstral coefficients 


MFCC 


16 


Audio spectrum envelope 


ASE 


14 


Spectral centroid 


CENT 


12 


Spectral crest factor 


SCE 


16 


Spectral flatness measure 


SFM 


16 


Zero crossing rate 


ZCR 


1 



by signal-specific feature vectors, the MIR community has developed a set of 
state-of-the-art feature vectors well performing for the music similarity search. 

In this paper we use a set of eight low-level features derived from the 10 ms 
frames of audio signal. This set includes a well established timbre descriptor - 
Mel-frequency cepstral coefficients (MFCCs) (Bogert, Healy, & Tukey, 1963), and 
several descriptors capturing rhythmic, loudness or frequency relations information, 
proposed within the MPEG-7 standard (Kim, Moreau, & Sikora, 2005). The list of 
the used features and their dimensionality is shown in Table 1 . 

The probability density function of the feature vector for each of the music sig- 
nals is later parameterized within a statistical model. To keep these models as simple 
as possible we represent a distribution for each of the features with an individual 
model. Here we assume the distinct dimensions within each of the features to be sta- 
tistically independent. Modelling the features independently also gives a possibility 
to combine the features with different time resolution, as there is no need to merge 
various aspect descriptors from one time-frame in one vector. The results obtained 
for each of the features independently are later merged together via aggregation 
process. The aggregation method is described below in Sect. 5.2. 



3 Statistical Models 

Defining a similarity measure between two audio signals which consist of mul- 
tiple feature vector frames still remains a challenging task. As it was shown in 
the previous studies, one of the possible solution is first to model the distribution 
of the feature vectors using a parametric statistical model, for example, such as 
Gaussian Mixture Model (GMM), and then to define a distance measure between 
the distributions via a distance measure between the models. This approach has 
several advantages: it enables a very compact and informative representation of 
an audio signal and it allows defining the similarity of two signals based on the 
parameters of the models. A good overview of applying various statistical models 
(ex. GMMs or k-means) for music similarity search is given in Aucouturier and 
Pachet (2004). Depending on the number of gaussians, such models can be tuned 
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to the necessary level of generalization. Thus the research (Aucouturier and Pachet) 
mainly concentrates on the optimization of the number of mixtures. 

The similarity metric between music pieces is reflected via a distance measure 
between the models. The majority of distance measures between GMMs is con- 
structed based on the distances between single mixtures of the models. For such 
metrics the distance between the mixtures inside of one model (for one music track) 
becomes of high interest. In this publication we compare the performance for the 
model with the maximized distance between the mixtures (k-means) to the classi- 
cal GMM with the same amount of mixtures (k-means initialization followed by 
EM algorithm Dempster, Laird, & Rubin, 1977). We also propose to use a GMM 
where the optimal number of gaussians is chosen using Bayesian Information Cri- 
terion (BIG). In addition we propose to use the semantic information about song 
segmentation. Song segmentation implies a time-domain segmentation and cluster- 
ing of the musical piece in possibly repeatable semantically meaningful segments. 
For example, the western pop song typically can be segmented into “intro”, “verse”, 
“chorus”, “bridge”, and “outro” parts. For similar songs not all segments might be 
similar. For the human perception, the songs with similar “chorus” are similar. We 
apply the song segmentation algorithm based on BIG, which is successfully applied 
for Speaker Segmentation task (Moschou, Kotti, Benetos, & Kotropoulos, 2007). 
Then we introduce four statistical models based on this segmentation information. 
As such we model each segment state (ex. all repeated “chorus” segments form 
one segment state) with one Gaussian, and then weight these gaussians in a mix- 
ture depending on the durations of the segment states. Thus frequently repeated and 
long segments get higher weights. The segmentation algorithm also provides us an 
information about the “importance” of every segment. As a rule, the “chorus” is 
judged to be the most “important” segment of the song, and the “verse” is named as 
a second important segment. We investigate the possibility to model the song just 
by using the most prominent (or two most prominent) segments. 

An overview of all applied models is given in Table 2. All models are formally 
written in a form of GMMs. 



4 Parametric Distance Measures 

In this paper we concentrate only on those distance measure techniques that don’t 
use any time and computation consuming re-sampling (like Monte Carlo or like- 
lihood ratio tests). All proposed models can be written in a form of GMMs, 
even if the models themselves are not GMMs in a strict statistical meaning (like 
3m ixJaneans or segm models). We need to write the distance between the models 
/(■*) = J2a^aN(x,fla.'Sa) = faM and g{x) = 0)b N{x , flh , '^h) = 

'^b ^bgh{x) with weighting factors JTa and cob, means /Tq and /r^, and covariance 
matrices Sq and Y,b correspondingly. A good example of such parametric dis- 
tance measure is a Kullback-Leibler divergence (KL-divergence) (Kullback, 1968), 
corresponding to a distance between two single gaussians 
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Table 2 Overview of the statistical models 



Short name 


Short description 


Data 


No. of 
mixtures 


EM 

algorithm 


one 


1 Gaussian per song 


Whole 

song 


1 


No 


gmm_bic 


GMM, number of mixtures 
is determined by BIC 


Whole 

song 


Adaptive 


Yes 


3mix_kmeans 


3 Mixtures, only k-means 
initialization 


Whole 

song 


3 


No 


3mix_em 


GMM with 3 mixtures, k-means 
- 1 - EM algorithm 


Whole 

song 


3 


Yes 


segm 


1 Gaussian per segment 
state 


Whole 

song 


Adaptive 


No 


segm_em 


GMM, initialization with “segm” 
- 1 - EM algorithm 


Whole 

song 


Adaptive 


Yes 


Isegm 


1 Gaussian for the most 
prominent segment 


One 

segment 


1 


No 


2segm 


Mixture of 2 Gaussians for 
2 prominent segments 


Two 

segments 


2 


No 



D{f\\g) = ^ ( log ^ + (/^/ - (/^/ - Ml) - • 

where / and g are single gaussians with the means /r ^ and /t,| and covariance 
matrices S^- and S| correspondingly, and d is the dimensionality of the feature 
space. Initially, KL-divergence is not symmetric and needs to be symmetrized 
DiifaWgb) = \ [D{fa\\gb) + D(gbWfa)]- Unfortunately, the KL-divergence for 
two GMMs is not analytically tractable. In a recent paper Hershey and Olsen (2007) 
presented several approximations of the KL-divergence between two GMMs with 
very promising results. In our evaluation we include two of the approximations pro- 
posed in Hershey and Olsen, namely “variational” and “variational upper bound”. 
Helen and Virtanen (2007) proposed a Euclidean distance between two GMMs, that 
doesn’t use the KL-divergence at all. One more well established technique for mea- 
suring distance between GMMs is Earth Mover’s distance (EMD) (Rubner, Tomasi, 
& Guibas, 1998). It was also successfully used in Music Similarity Search (Logan 
& Salomon, 2001). 

In this paper we additionally include in the evaluation board two other approx- 
imations of KL-divergence for the GMM case. The first one is so-called Minimal 
Distance, finding the closest gaussian for each of the mixtures and then weighting 
and symmetrizing the results: 



Dklljnini.f \\g) — 2 



E oJa min D2ifa \\gh) + y^ tth min Dtifa \\gh) 

h f ^ n 



( 1 ) 
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Table 3 Overview of the parametric distance measures 
Short name Description 
kl2_min Minimal distance, given by (1) 

ed Euclidean distance, defined in Helen and Virtanen (2007) 

kl2_ed Hybrid KL-Euclidean distance, defined in (2) 
kLvar Variational approximation, defined in Hershey and Olsen (2007) 

kLupper Variational upper bound, defined in Hershey and Olsen (2007) 
kl2_emd Earth Mover’s distance, based on symmetrized KL-divergence 

kLernd Earth Mover’s distance, based on non-symmetrized KL-divergence 



Additionally we introduce Hybrid KL-Euclidean approximation: 

Dkl2_ed{f\\g) = EE- aUlbDlifaWgb) - ^ EE-^-^ 'D2{Ufa') 

a b a a' 

-^EE coba>h'D2{gb\\gh')- (2) 

b b' 

Although it is not a distance metric in strict mathematical meaning, it shows good 
performance for music similarity task. An overview of all used parametric distances 
is given in Table 3. 



5 Evaluation 

5.1 Test Data and Evaluation Metric 

For quantitative evaluation of the proposed models and distance measures within 
the described music similarity system, a test-set of full-length music pieces has been 
assembled. Altogether, the test-set consists of 775 tracks, belonging to 10 musical 
genres subdivided into 60 sub-genres (see Dittmar, Bastuck, & Gruhne, 2007 for 
details). For evaluation the following methodology is applied. For each song in the 
test set, five most alike recommendations (excluding the self-match), further denoted 
as Top five, are computed using the proposed system and features. Instead of strictly 
evaluating the same sub-genres, a relation matrix taking the correlation between 
sub-genres into account has been used. The values in this matrix range are in the 
interval [0. . . 1], where 1 indicates a strong degree of inter-genre similarity. The 
average similarity is computed by simply taking the mean of the relation matrix 
entries assigned to the Top five. 
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5.2 Aggregation Process 

The aggregation process is necessary for merging the individual similarity search 
results achievable with each of the described features in isolation. Therefore, the 
following modihcation of the Borda criterion (Dwork, Kumar, Naor, & Sivakumar, 
2001) is used to combine the similarity result lists. For each similarity search, result 
list obtained by the single features and the respective model and distance, the num- 
ber of entries taken into consideration is limited to a certain rank. Two measures are 
assigned to every remaining item: the normalized mean inverse rank X and the nor- 
malized occurrence statistic Y . Finally r = aX -|- (1 — q!)T performs a weighting of 
either the rank or the occurrence corresponding to parameter alpha. Both measures 
express to a certain extent the consensus between the features and the corresponding 
result lists. Thus, the aggregation scheme will emphasize those items that have been 
consistently favoured by the single feature spaces. 



6 Results 

As it has been already mentioned in Sect. 2, in order to reduce the complexity of the 
models each of the features is modeled with an independent model. This scenario 
leads to 7 X 8 X 7 results space, corresponding to seven features, eight statistical 
models and seven parametric distance measures. Due to lack of room we concen- 
trate only on the most remarkable trends. The best performing models and distance 
measures for each feature are depicted in Fig. 1. The corresponding models and 
distances for the three best results are given in Table 4. As one can see, the best per- 
formance of 0.41 is reached using ASE feature with classical three mixture GMM 
and klljnin distance measure. All in all the klljnin distance pops up frequently 
on Top three places in Table 4, indicating its stable overall performance. The other 




1 St place 
^■2nd place 
I 1 3rd place 



Random 



Fig. 1 Three best results for every feature. The similarity is computed by taking the mean of the 
relation matrix entries assigned to the Top five. The corresponding combinations of the statistical 
model and parametric distance measure are given in the Table 4 
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Table 4 Best performing combinations of statistical models and parametric distance measures for 
each of the feature. For the performance results see Fig. 1 



Feature 


1st place 


2nd place 


3rd place 


Model 


Distance 


Model 


Distance 


Model 


Distance 


ASE 


3mix_em 


kl2_min 


segm.em 


kl2_min 


segm 


kLvar 


LogLoud 


segm 


kLvar 


segm 


kl2_min 


2segm 


kl2_min 


NormLoud 


segm.em 


kl2_min 


segm 


kl2_min 


segm 


kl2_emd 


MFCC 


3mix_em 


kl2_emd 


3mix_em 


kl2_min 


one 


kl2_min 


SCF 


gmm_bic 


kLvar 


gmm_bic 


kl2_min 


one 


kLvar 


CENT 


one 


kl2_min 


2segm 


kl2_min 


2segm 


kl2_emd 


SFM 


segm 


ed 


segm 


kl2_emd 


gmm_bic 


kLemd 




■ kl2_mln 

■ ed 

■ kl2_ed 

■ kl_var 

■ kl_upper 

I I kl2_emd 

I I kl emd 



Random 



Fig. 2 Results for all statistical models and parametric distance measures after the aggregation 
process. The similarity is computed as a mean of the relation matrix entries assigned to the Top 
five 



remarkable observation one can make on Table 4 is a fact that in 12 of 21 cases 
the Top three models are using the song segmentation information. For example, for 
the loudness related features (LogLoud and NormLoud) all Top three models are 
based on segmentation. 

To combine the results derived for each feature we perform an aggregation as 
described in Sect. 5.2. Surprisingly after the aggregation the best performance of 
0.45 is achieved by one-gaussian model (see Fig. 2). In a case of one-gaussian mod- 
els all best performing distances {klljnin, kl2_ed, and kl2_emd) are simplified 
into classical KL-divergence between two single gaussians. Although one-gaussian 
model doesn’t often perform the best for individually taken features, the model is 
stable and shows good consensus during the aggregation process. 

Overall, results after the aggregation (Fig. 2) point out the best performing 
distance measures, namely kl2jnin, kl2_ed, and kl2_emd. While comparing 
the performance for the pairs of the models 3mixJcmeans — 3mix-em and 
segm— segm-cm one can see the slight decrease of the performance for the major- 
ity of the models. This fact could be an indicator that applied approximations of 
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KL-divergence work more accurate when the distance between the mixtures within 
one model is maximized. 



7 Conclusions 

In this paper we presented the evaluation of applying various statistical models and 
parametric distance measures for music similarity search. Based on seven low-level 
features, we tested and compared eight different models and seven distance metrics. 
Among the state-of-the-art modeling with GMMs and k-means, we proposed to use 
the models, based on the segmentation information. We compared the performance 
of the well recommended approximations of the KL-divergence between GMMs and 
introduced our self-dehned killed distance measure. The evaluation was performed 
on a test set of 775 tracks. 

The paper presented the most characteristic results of the evaluation. Thus we 
found out that there was no combination of statistical model and distance mea- 
sure which outperformed the others for all applied feature vectors. The comparative 
study showed that among utilized features the best results were reached using 
ASE, LogLoud, NormLoud , and MFCC. High performance was achieved 
while modeling the distribution of each feature just with one gaussian. 

The evaluation showed that the song segmentation information could be suc- 
cessfully used for music similarity search. Results achieved while using one or 
two most prominent segments were only slightly worse than those derived using 
the whole track. The consensus ranking aggregation process significantly improved 
the results. The investigation of various approximations of KL-divergence between 
two GMMs pointed out that the most reliable of applied distance measure were 
klljfiin, kll-cd, and kll-cmd. 
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Finding Music Fads by Clustering Online Radio 
Data with Emergent Self Organizing Maps 



Florian Meyer and Alfred Ultsch 



Abstract Music charts provide a simple statistic of sold records. Web 2.0 provides 
social networks, where detailed information from listeners is available. In particular, 
there are keywords, so called tags, that are given by the network members to classify 
songs into genres. 

An important topic are music fads, i.e., small time intervals of a few weeks with 
a strong presence of similar music genres. We introduce a distance on the weekly 
music charts to uncover music fads. Fads are visualized using Emergent Self Orga- 
nizing Maps (ESOM). They are automatically found by analysing the progress of 
the impact of music genres. This algorithm does not rely on an estimation of the 
number of fads. Dominant genres of the fads were found to characterize them. 

Keywords Clustering ■ Music fads ■ Self organizing maps. 



1 Introduction 

To find the right time for placing a song in the market is very important. Music 
genres (like colours for clothes, shoe brands, . . . ) come in and out of fashion. Such 
fashions last often only for a few weeks. To use these brief fashions, so called fads, 
we developed an easily usable method to visualize and analyse the behaviour of 
fashions by observing online radio data. A great benefit of using online data is, that 
they are both free and up to date. It is an easy and cheap way to follow the fads. 

Tagging is often referred to as the process of assigning keywords to a special 
group of objects and is an important feature of community based social networks 
like Flickr, YouTube or Last.fm. We used the user-generated descriptions of Last.fm 
to generate features that describe songs. Tagging is already used by many users 
classifying items, being controlled by the creator and consumer of the content. For 
our study we chose to analyse the data provided by the music community Last.fm, 
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an internet radio broadcaster featuring a music recommendation system. The users 
can assign tags to songs. Tags make it possible to organize the media songs in a 
semantic way and makes a useful base for discovering new music trends. Because 
of the huge amount of songs offered by the online radio station, it is necessary to 
reduce the online data. One possible way is to concentrate on the most important 
(meaning most often heard) songs, but this way much information is lost. Another 
problem with using song charts is that hit songs usually stay only for a few weeks 
at the top of the charts - but that does not mean, that after a song has gone there is 
a new music style. A better way is to transform the song charts into genre vectors. 
We use tags to assign genres to the songs and look at which genre becomes popular 
instead of analysing songs directly. Another advantage of examining genres instead 
of songs is the possibility to adapt the model. If one is interested only in certain 
genres, he or she can easily select them. 

An intuitive user interface is required to avoid losing an overview. We propose the 
Emergent-Self-Organizing-Map (ESOM) (Ultsch, 2003) to visualize the genre vec- 
tors. It is topology preserving and combined with the U-Map it provides a visually 
appealing user interface and an intuitive way of exploring new content. 



2 Related Works 

There has been some work on enhancing the user interface based on tags and we 
will briefly mention some here. Elickr uses “Elickr clusters” which can provide 
related tags to a popular tag, grouped into clusters. Begelman, Keller, and Smadja 
(2006) used clustering algorithms to find strongly related tags visualizing them as 
a graph. Hassan-Montzro and Herrero-Solana (2006) proposed a method for an 
improved tag cloud and a technique to display these tags with clustering based 
layout. 

The ESOM has already been used successfully to visualize collections of music 
and photos and on clustering documents. Most of these works have in common 
that they cluster the data based on features extracted directly from the media. An 
example is MusicMiner (Morchen, Ultsch, Nocker, & Stamm, 2005) which uses 
the timbre distance, a measure based on frequency analysis of audio data. The 
WEBSOM project (Kaski, Honkela, Lagus, & Kohonen, 1998) is an ESOM based 
approach in free text mining. Here each document is encoded as a histogram of 
word categories which are formed by the ESOM algorithm based on the similarities 
in the contexts of the words. Our approach is different however information is not 
used that can be extracted from the objects raw data itself but instead user gener- 
ated content. The works mentioned above show that the ESOM is a powerful tool 
in visualizing high dimensional data. In Lehwark, Risi, and Ultsch (2007) music 
was clustered by using tag information and it was shown how an ESOM can help to 
navigate through the music in an intuitive way. 



Finding Music Fads 



421 



3 Data 

The data that is used in this article are taken from the online radio Last.fm. We 
used data from 1 10 weeks, starting 2005 and ending 2007. All together, our statis- 
tics are based on more than 6 million songs and more than 75,000 tags. Tags are 
short symbolic descriptions, e.g., “heavy metal”, “favorite song”, etc. The users 
of Last.fm assign tags to songs and browse the content via tags allowing them to 
only listen to songs tagged in a certain way. The tagcountij = tij is the number 
of users who assigned a tag i to a song j . For each week, we have the number 
how frequently a song was played by users. These numbers are fixed for every week 
i^Waii = {l,...,110}ina6 million dimensional vector called song vector denoted 
as Si . An even larger matrix (6 million x 75,000) is required to save how often a tag 
is assigned to a song. 



4 Frequential Genre Integration 

There are many more tags than genres that can be considered so we have to remove 
the ones that do not stand for a certain kind of music genre, such as “seen-live”, 
“favourite albums”, etc. 

The tf-idf (term frequency - inverse document frequency) technique is often used 
in documents to find words in the text which are able to characterise the document. 
These words should not appear too often (like the articles the, a, . . . ) but should also 
not be very rare in the text. The tf-idf algorithm gives every word a weight which 
shows how weel they can be used to characterise the text. 

Last.fm provides the number of people (t,y = tagcountij ) that have used a spe- 
cific tag for an song j . The tij were scaled to the range of [0, 1]. Then the term 
frequency fj can be calculated as 



fu = 



% 



with the denominator being the accumulated frequencies of the other tags used for 
a specific song. The term frequency (in this case a tag frequency) indicates how 
specific a tag for a certain song is. The inverse document frequency f, is then defined 
as follows: 



fi = log 



N 



Hi 



with N being the total number of songs in the collection and «, the number of songs 
that have been assigned to the tag i . The inverse document frequency indicates how 
specific a tag is. Tags like “favorite song” are assigned to nearly every song. So they 
are not very useful in describing the character of the songs. 
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The weights 



Wij = fijfi 



were used to detect the important tags. 

How much impact the genres have on certain weeks can now be calculated by 
applying the weights matrix W = {wij} to the weekly song vector Sj. 



In this way, we receive 110 weekly genre vectors gi and have a strong reduce the 
dimension. The song vectors 5 , have a dimension of over 6 million, in the genre vec- 
tors the dimension in cut down to 600. Every component of the vector g, describes 
the impact of a genre in the week i . The genre vectors are not only smaller than the 
song vectors, they are also more appropriate for analysing music fads because the 
influence of solitary songs is marginal. 



5 Visualisation of Music Fads 

An ESOM is an artificial neural network that performs a nonlinear and discontin- 
uous projection which is able to preserve topographic structures such as clusters. 
The genre vectors are mapped onto a two-dimensional grid of neurons. The grid is 
toroid to avoid boundary effects. In contrast to the K-mean SOM, the ESOM has 
significantly more neurons than there are expected clusters. 

The unsupervised training process is partly motivated by how visual information 
is handled in the cerebral cortex of the mammalian brain and equals a regression 
of an ordered set of model vectors /m, g R'' into the space of observation vectors 
X G R" by performing the following process: 



where t is the sample index of the regression step, whereby the regression is per- 
formed recursively for each presentation of a sample of x. Index c, the best matching 
unit (BMU) or winner, is defined by the condition 



nii{t -H 1) = niiit) -h hc{x),i{x{t) - m,(0). 



\\x(t) - mdOW < \\x(t) - Vi. 



The neighbourhood function h is the Gaussian 




where 0 < a(t) < 1 is the learning-rate factor, which decreases monotonically with 
the regression steps, r, and are the vectorial locations in the display grid and a(t) 
corresponds to the width of the neighbourhood function, which is also decreasing 
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Fig. 1 In the left picture the movement of the genre vectors on the ESOM is shown. The right 
picture shows the corresponding U-Matrix 

monotonically with the regression steps. For a more detailed discussion of the SOM 
see Kaski et al. (1998). 

The U-Map (Ultsch, 2003) is constructed on top of the map of ESOM. The 
U-Height for each neuron n, equals the accumulated distances of n, to its immediate 
neighbors N{i): 



where d{x, y) is the distance function used in the SOM algorithm to construct the 
map and N{i ) denotes the indices of the immediate neighbours of neuron i . 

A single U-Height shows the local distance structure of the corresponding neu- 
ron. The overall structure of distance emerges, if a global view of a U-Map is 
regarded. A U-Map is usually displayed as a three-dimensional landscape and has 
become a standard tool to display the distance structures of the ESOM. The U-Map 
delivers a “landscape” of the distance relationships of the input data in the data 
space. It has the property that weight vectors of neurons with large U-Heights are 
very distant from other vectors in the data space and that weight vectors of neurons 
with small U-Heights are surrounded by other vectors in the data space. Outliers and 
other possible cluster structures can easily be recognized. U-Maps have been used 
in a number of applications to detect new and meaningful information in data sets. 

Figure 1 shows a toroid ESOM with 82 x 50 neurons that was trained with the 
110 weekly genres vector using the Databionics ESOM Tools. The left picture 
shows a line drawn form one BMU to the next. The right picture shows the resulting 
U-Map. 

To get a plain island map with a unique representation for every neuron we have 
to cut the toroid map along the highest hills. 

6 Identification of Fads 




16 A' to 



In the last chapter an U-Map with a unique representation for the genre charts was 
created. On this map, it can be seen that the genre vectors follow a great valley, but, 
from time to time they “jump” over a small hill. 
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Fig. 2 The left picture shows the Euclidian distances of the genre vectors from one week to the 
next. On the right the logarithmic distribution of the distances is shown 

Obviously, the genre vectors are not randomly distributed. Chronologically neigh- 
bouring vectors are represented nearby on the U-map. This shows the strong 
conjunction between the time and the genre vectors. On the other hand there are 
some relative highly hills between some genre vectors and their followers. So there 
is a steady low movement for some weeks but then the genre vector shifts strongly 
from one week to the next. The time intervals with the low moments of the genre 
vectors are called music fads. 

To find the music fads means to find those gaps in the data. In Fig. 2 the Euclidian 
distances of the genre vectors between one week and the following week are plotted. 
Some extreme values show that the genre vectors do not move continuously. They 
show, when a “jump” happens and when a new fad begins. 

There are 110 weeks and so there are 109 distances between them. The dis- 
tribution of the log transformed distances is also shown in Fig. 2 (on the right 
side). 

This distribution can be approximated by a mixture model of three normal dis- 
tributions. The two largest stand for normal weekly distances, the third contains the 
“jumps”. To find a boundary value to decide how large a weekly distance must be if 
a new fad has started the two large gaussian were separated from the small one by a 
Bayes decision. The boundary that was found this way is about 1.15. According to 
this we have six music fads. 



7 Fads Characterisation 



To make the music fads into useful information it is necessary to characterise the 
fads. For this purpose we calculate the fad genre vector Fg as the average over the 
weekly genre vectors in a fad 
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where Wk are the weeks of the fad number k and g, are the genre vectors of the 
week i . We compare this cluster genre chart to the average of the genre vectors of 
the whole time period 



Dk = Fk — 



\Wall\ 



where Waii is the set of all 110 weeks. 

Dk is a vector which contains the displacement of the impact a genre has during 
a fad from its average impact. This gives us a metric ranking of the importance of 
the genres. 



8 Results 

As a result of this analysis 6 music fads have been found. The music fads vary 
strongly in their length. While the shortest last only for 2 weeks, the longest fad 
lasted for 50 weeks. The changes in the music style are easily found and the fads 
can be characterised by the dominant music genres. For each music fad there is a 
displacement vector which shows the impact of a genre for this fad. Those genres 
which have the most influence on the fad have the highest numbers. You can see the 
most important genres assigned to their music fads in Fig. 3. 

9 Discussion 

It could be objected that online radio users are not a representative sample of the 
population. That is certainly true. Though, it is an interesting community for music 
research. A music manager usually is not interested in the whole market but only in 
his field of activity. It is easy to adapt this method for personal interests by selecting 
a special set of genres. 

This method does not make predictions of music fads, but it is possible to recog- 
nise a new fad very quickly. Of course it would be better to identify music fads 
before they started, but that is hardly possible. 

The tf-idf technique shows the impact of a genre. It does not state how many 
genres are optimal for the analysis. In this paper we took 600 genres which pro- 
duced good results. Of course, the selection of genres also depends on the aim of 
the analysis. Though the tf-idf technique is a useful tool, the genre selection requires 
external knowledge and experience. 

10 Summary 

Visualization of music fads using user-generated tags was demonstrated to work 
well. We use the Emergent-Self-Organizing-Map (ESOM) to visualize the genre 
vectors. It is topology preserving and combined with topographical maps it 
provides a visually appealing user interface and an intuitive way to understand 
fashions in the complex music market. The temporal fashion development is shown 
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Fig. 3 The characteristic genres for the music fads are (1) 90s, Stoner rock, industrial; (2) hip-hop, 
britpop, rap; (3) German, industrial metal, industrial; (4) metal, indie rock, indie; (5) progressive 
metal, progressive rock, metal; (6) funk, progressive metal, funk rock 



as a path on the U-map in valleys surrounded by mountains. In these valleys there 
exist hills that separate the music fads. There is an easy way to find the characteris- 
tics of a fad by comparing the average genre vector with the fad genre vector. This 
way, we get the genres that have the highest impact of the fad. This information 
is important for every music manager who decides when to place a song on the 
market. 
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Analysis of Polyphonic Musical Time Series 



Katrin Sommer and Claus Weihs 



Abstract A general model for pitch tracking of polyphonic musical time series 
will be introduced. Based on a model of Davy and Godsill (Bayesian harmonic 
models for musical pitch estimation and analysis, Technical Report 431, Cambridge 
University Engineering Department, 2002) Davy and Godsill (2002) the different 
pitches of the musical sound are estimated with MCMC methods simultaneously. 
Additionally a preprocessing step is designed to improve the estimation of the fun- 
damental frequencies (A comparative study on polyphonic musical time series using 
MCMC methods. In C. Preisach et al., editors, Data Analysis, Machine Learning, 
and Applications, Springer, Berlin, 2008). The preprocessing step compares real 
audio data with an alphabet constructed from the McGill Master Samples (Opolko 
and Wapnick, McGill University Master Samples [Compact disc], McGill Univer- 
sity, Montreal, 1987) and consists of tones of different instruments. The tones with 
minimal Itakura-Saito distortion (Gray et al.. Transactions on Acoustics, Speech, 
and Signal Processing ASSP-28(4):367-376, 1980) are chosen as first estimates 
and as starting points for the MCMC algorithms. Furthermore the implementation 
of the alphabet is an approach for the recognition of the instruments generating the 
musical time series. Results are presented for mixed monophonic data from McGill 
and for self recorded polyphonic audio data. 

Keywords Alphabet ■ MCMC • Musical time series ■ Polyphony. 



1 Introduction 



In this paper a model for polyphonic sound will be introduced. There are two aims 
for the analysis of polyphonic sound. One aim is the automatic transcription of musi- 
cal time series, the other aim is instrument recognition and instrument tracking. 
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For this we implemented a model for pitch tracking based on the model of Davy 
and Godsill (2002). The results of this model will be improved by preprocessing. 
By means of the preprocessing there are chances to recognize the instrument and 
therefore to track instruments and notes simultaneously. 

The outline of the paper is as follows. First the model for polyphonic sound will 
be introduced. Then the preprocessing step will be considered. We introduce the 
design of an alphabet of notes and discuss some distortion measures which are used 
to compare the real audio data with the alphabet. 

After that we apply the polyphonic model and the preprocessing step to real audio 
data from the McGill University Master Samples (Opolko & Wapnick, 1987) and to 
self-recorded audio data. Finally the results are discussed and an outlook to future 
work is given. 



2 Model for Polyphonic Sound 

In this section the harmonic polyphonic model will be introduced and its compo- 
nents will be illustrated. The model is based on the model of Davy and Godsill 
(2002) and has the following structure: 

K Ht I 

yt = EEE (pi{t) {akXi cos {Inhfk/fst) -|- bkXi sin {lithfk/ fy)} + G- 

k=\ h=\ i=0 

The number of observations of the audio signal is T, ? e {0, . . . ,T — 1}. 
Each signal is normalized to [—1, 1] since the absolute overall loudness of differ- 
ent recordings is not relevant. The signal y, is made up of K tones each composed 
out of harmonics from Hk partial tones. In this paper the number of tones K is 
assumed to be known. The first partial of the k-th tone is the fundamental frequency 
fk, the other Hk — \ partials are called overtones. Further, is the sampling rate. 

To reduce the number of parameters to be estimated, the amplitudes ak.h.t and 
bk.h.t of the k— th tone and the /z-th partial tone at each timepoint t are modelled 
with 7-1-1 basis functions. The basis functions <ptj are equally spaced banning 
windows with 50% overlap: 



(f>u := cos^ [7t(t - i A)/(2A)] 1[(,_i)a,(;+i)a](0, 



with A = (7’ — 1)//. The use of basis functions allows the exploration of musical 
data without constant amplitudes. Changes in loudness can be modelled without 
getting an overflow of parameters. For more details on the use of basis functions see 
Sommer and Weihs (2006). 

Farther the ak.h.i ™d bk,h,i are the amplitudes of the k-th tone, the h-th partial 
tone and the i -th basis function. Finally, e, is the model error. 
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The model can be written as a hierarchical Bayes model. The estimation of the 
parameters results from the stochastic search of the best coefficients in a given 
region. The region and the probabilities are specified by distributions. This leads to 
the implementation of MCMC methods (Gilks, Richardson, & Spiegelhalter, 1996). 

Instead of a full generation of the distributions as in standard MCMC methods 
which is computationally very expensive we use an optimal model fit. For this we 
implemented a stopping criterion which checks every 50 iterations whether a linear 
regression of the latest 50 iterations against the iteration number is significant. The 
iterations are stopped when a dedicated significance level is exceeded. For this pur- 
pose a maximum number of 500 iterations is assumed. For details of the MCMC 
computation and the use of the stopping criterion to get the optimal fit of our model 
see Sommer and Weihs (2007). 



3 Preprocessing 

The aim of a preprocessing step is the improvement of the results of the MCMC 
estimation of the polyphonic time series data. We assume that better starting values 
which lie near the true values of the correct notes lead to better results. For this we 
construct a so-called alphabet which includes all possible notes of all possible instru- 
ments. The actual tone will be compared with all entries in the alphabet. The two 
notes with minimal distance are the starting values for MCMC. First we describe 
the construction of the alphabet then we discuss different distortion measures which 
can be used to compare the real audio data with the data of the alphabet. 



3.1 Alphabet 

The alphabet is composed of one entry for each note of each instrument chosen to 
be compared with the audio data. 

For one entry in the alphabet the mean periodogram (or spectral density) of seven 
time intervals with each 512 datapoints with 50% overlap is evaluated. The first 
1 ,000 datapoints of the note are cut-off to get the stable and constant part of a note 
as reference. 

The audio data will be analyzed by comparing the periodogram of the audio data 
with the summed periodograms of each two-note-combination in the alphabet. In 
order to allow for the possibility that only one instrument is playing in a piece of 
music, one entry for “silence” is added. 

The two notes with minimal distortion are starting values for the MCMC compu- 
tation. We used several distortion measures to check which fits best for the musical 
data. 
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3.2 Distortion Measures 



Distortion measures has been established for the analysis of speech. A traditional 
distortion measure for speech is the L^-Norm of the difference of the log spectra: 



dipi/.g) = \\ln f -IngWp = \\lnif/g)\\p 



T N/2 

ztt ^ 



j=i 



In 



/K) 

g(0}j) 




where / and g are estimated spectral densities. These two spectral densities are the 
periodogram of the alphabet and the periodogram of the real note. The periodogram 
is computed at the Fourier frequencies = j/T. 

Common choices for p are 1 , the absolute mean deviation, 2, the root mean square 
deviation and oo, the maximum deviation. 

Beside the computation of the L^-Norm of the empirical spectral densities one 
can compare the tones with the Lp of the empirical spectral distributions: 



dcLp(F. G) 




E 



j=N /2+1 



F(a>j) - G(o)j) 




i/p 



F and G are the estimated integrated spectra. 

Finally we take a distortion measure in account which is used for the analysis of 
the spectral shape of speech, the Itakura-Saito distortion: 



dis{f,g) = —i^ 



f=i 



./K) _ /K) _ 
gijMj) g{02j) J ■ 



The Itakura-Saito distance is a measure which weights bigger values in the 
spectrum more than zeros (Gray, Buzo, Gray, & Matsuyama, 1980). 



4 Results 

In this section results of estimating the fundamental frequencies of real audio data 
will be discussed. First, the data used in our studies will be introduced. Further the 
construction of an alphabet will be reconsidered and then the results based on this 
alphabet are depicted. Finally additional results are shown. 



4.1 Data 

The data used for our polyphonic studies are real audio data from the McGill 
University Master Samples (Opolko & Wapnick, 1987) and self-recorded data. From 
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the McGill University Master Samples we chose hve instruments (electric guitar, 
piano, violin, flute and trumpet) each with five notes (c4: 262 Hz, e4: 330 Hz, g4: 
390 Hz, a4: 440 Hz and c5: 523 Hz) out of two groups of instruments, string instru- 
ments and wind instruments. The three string instruments are played in different 
ways, namely picked, struck and bowed. The two wind instruments are a woodwind 
instrument and a brass instrument. So different types of instruments are considered 
in our analyses. 

Each of the five instruments was combined with the piano at 262 Hz as lower 
tone. There are five instrument-combinations: piano-electric guitar, piano-piano, 
piano-flute, piano-violin and piano-trumpet. Every instrument-combination was 
analyzed for every note-combinations. Eor this there were 25 combinations from 
the McGill data. For an overview see Table 1 . 

The self-recorded audio-data were recorded with flute, piano and violin. For the 
ability to construct an alphabet out of the self-recorded data we sampled from each 
instrument monophonic notes. Further we recorded two-note combinations from 
seven instrument combinations, flute-violin, flute-piano, violin-flute, violin-piano, 
piano-flute, piano-violin and piano-piano. Every instrument combination plays five 
note-combinations: c4-c4, c4-e4, c4-g4, c4-a4, c4-c5. Partly the notes are played 
one octave higher. This fact will be regarded by counting the number of correctly 
estimated note-combinations. The first instrument in the instrument-combinations 
plays the c4, the second instrument plays the other notes. That is, there are 35 note- 
combinations to be analyzed (see Table 1). 

For the note-combinations were computed the estimates for ten time-intervals. 
The estimation of the fundamental frequencies results by voting over the results of 
the ten time intervals. The notes which occurred most often are the estimations. 

In addition a clerical canon called Halleluja was recorded. 

4.2 Construction of the Alphabet 

For our alphabet we used the audio data from the McGill University Master Samples 
and the monophonic notes of the self-recorded audio-data. From the McGill Univer- 
sity Master Samples were chosen eight instruments, flute, bassoon, cello, electric 

Table 1 Overview over note- and instrument combinations of the McGill data and the self- 
recorded audio-data used for the analyses. Each note-combination is joined with each note- 
combination leading to 25 combinations for the McGill Data and 35 combinations for the 
self-recorded audio-data 



Note-combinations 


Instrument-combinations of 




McGill data 


Self-recorded audio-data 


c4— c4 


Piano-electric guitar 


Piano-flute 


c4— e4 


Piano-piano 


Piano-piano 


c4— g4 


Piano-violin 


Piano- violin 


c4— a4 


Piano-flute 


Violin-flute 


c4— c5 


Piano-trumpet 


Violin-piano 

Flute-piano 

Flute-violin 
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guitar, piano, trumpet, tuba and violin, with all available notes of this instruments. 
This are 348 notes. Additional all 86 monophonic notes from the self-recorded 
audio-data were added to the alphabet. With the entry for “silence” there are 435 
entries in the alphabet. 

So for each time interval to be analyzed there are 94,830 comparisons which 
means a high computational burden when there are more instruments and notes in 
the alphabet. 



4.3 First Results 

The first analyses of polyphonic data were made with superimposed oscillations 
of two tones from the McGill University Master Samples. The first tone was a c4 
(262 Hz) played by the piano. This tone was combined with each instrument-tone 
combination we used. So we had 25 datasets each normalized to [— 1 , 1] . The pitches 
of the tones were tracked over ten time intervals of T = 512 observations with 
50% overlap at a sampling rate of 11,025 Hz. The number of observations in one 
time interval is a tradeoff between the computational burden and the quality of the 
estimate. In 15 cases of the 25 datasets both notes were estimated correctly with 
MCMC. When multiples of the correct notes are included as correct estimations, in 
21 cases the resulting notes were correct. 

With the preprocessing step, an alphabet of 150 notes of the McGill University 
Master Samples from the five instruments flute, electric guitar, piano, trumpet and 
violin, all note-combinations were estimated correctly when compared with the dis- 
tortion measure dii- This is obvious because the note-combinations and the audio 
data derive from the same audio data (Sommer & Weihs, 2007). 



4.4 Comparison of Distortion Measures 

In this section we present the results of the comparison of the distortion measures 
on the basis of 35 note-combinations of the self-recorded real-audio data. In Table 2 
the results can be seen. In the left column the number of correct estimated note- 
combinations is displayed. In all cases multiples of the correct tones are allowed. 
One can see, that there are not many correct estimated notes when the distor- 
tion measures from the cumulated spectral function are used, ddoo estimates all 
note-combination incorrectly. In only seven combinations one tone is estimated cor- 
rectly. All other distortion measures estimate in over 30 cases one correct note. The 
Itakura-Saito distance J/jist the best distortion measure to find both notes when 
multiples of the notes are allowed. 

Looking in detail on the results with the Itakura-Saito distortion: 15 of the 24 
note-combinations are estimated correctly without including multiples. Adding the 
MCMC computation results in 19 correctly estimated note-combinations. Counting 
multiples as correctly estimated leads to 3 1 correct note-combinations. Apparently 
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Table 2 Comparison of 


Distortion 


Notes 


distortion measures with the 


measure 


Both correct 


One correct 


number of correctfy estimated 








note-combinations of the 


dis 


24 


35 


self-recorded audio data 


dn 


14 


33 




dti 


16 


35 




dtoo 


14 


35 




dcU 


9 


31 




dcLl 


16 


34 




dcLoo 


0 


7 




Fig. 1 Melody of Halleluja 



combination of the preprocessing and the MCMC computation increase the number 
of correctly estimated note-combinations. 



4.5 Halleluja 

In this section we present results on a real piece of music of the self-recorded data. 
In Fig. 1 the first four bars of a clerical canon called Halleluja can be seen. The 
melody originates from England of the nineteenth century. The canon is played 
by flute and violin. For each of the 500 time-intervals the starting values for the 
MCMC computation are estimated with the alphabet. As distortion measure the 
Itakura-Saito distance djs is applied. 

In Fig. 2 two melody-plots are displayed (Weihs and Ligges (2006)). The left 
melody-plot represents the results of the alphabet, the right melody-plot shows the 
results of the MCMC computation. The grey boxes highlight the true tone pitches, 
the black dots show the estimations of the pitch. The line at the bottom of Fig. 2 
displays the energy of the original audio data. Small values of the energy indicate 
silence of the data. 

Further Fig. 3 shows the results after quantization which means the classihcation 
of the time intervals to notes in bars. After this, the results can be transcribed in 
musical notation. Again, the grey boxes highlight the true melody. The diamonds 
display the upper voice which is played with flute, the boxes represent the lower 
voice played with violin. 

In the quantization plot it can be seen that the upper voice of the flute is estimated 
correctly in all cases beside the cases at the end of the second and the fourth bar. 
Reasons for this can be breathing time and notes which are played too long or too 
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Fig. 2 Melodyplot of Halleluja, results of alphabet {left) and MCMC computation {right) 
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Fig. 3 Quantization plot of Halleluja 



short. The lower voice of the violin is estimated well. There are some misclassifica- 
tions where the quantization assigns the notes to the upper voice. There is only one 
note in the third bar where the note is estimated completely incorrect. 



4.6 Instrument Tracking 

Finally it will be examined whether the alphabet is adequate for instrument tracking 
as well. For the analysis the same data as in Sect. 4.4 were used. 

The results of the instrument tracking (cf. Table 3) are not as promising as the 
results for the pitch tracking (cf. Table 2). The maximum number of correct esti- 
mations of both instruments is 1 1. In 18 to 29 of the 35 cases one instrument was 
recognised correctly. The Itakura-Saito distortion is not the best distortion measure. 
All distortions from the L^-norms and ddi lead to better results, the results with 
dll is the best of the seven distortion measure. 
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Table 3 Distortion measures 
instrument 



Distortion 

measure 


Instalments 


Both correct 
(including 


One correct 
multiples) 


dis 


5 


26 


du 


9 


27 


dci 


11 


29 


^Loo 


6 


27 


dcLl 


10 


29 


dcL2 


6 


27 


^cLoo 


0 


18 



5 Conclusion 

In this paper a pitch tracking model for polyphonic musical time series data has been 
introduced. The unknown parameters are estimated with an MCMC algorithm as a 
stochastic optimization procedure. Because of the unfavorable results in a first study 
with polyphonic data the polyphonic model was extended and a preprocessing step 
was implemented. The application of an alphabet of notes leads to promising results. 
The comparison of different distortions measures shows that one measure is not ade- 
quate for both pitch and instrument tracking. The combination of the preprocessing 
and the MCMC algorithm is even more encouraging for pitch tracking. 

Future work will focus on instrument tracking. 
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Hedge Funds and Asset Allocation: 

Investor Confidence, Diversification Benefits, 
and a Change in Investment Style Composition 



Wolfgang Bessler and Julian Holler 



Abstract Based on the belief that hedge funds are able to generate positive risk- 
adjusted returns (alpha) and diversification benefits in a portfolio context, many 
investors have included hedge funds in their asset allocation in order to optimize the 
risk-return trade-off of their investments. We provide evidence that more optimistic 
prior beliefs about expected risk-adjusted returns (alpha) lead to higher allocations 
into hedge funds. It appears, however, that history may not be the best guide for 
future fund performance and that the diversification benefits have declined over 
time. One reason for the lower risk-adjusted returns is a capacity effect in that pre- 
viously exceptional hedge fund returns caused higher inflows to these funds and 
consequently a competition for alpha among investors. In our empirical analysis we 
provide additional evidence of other explanations for decreasing hedge fund bene- 
fits such as an increase in correlations with other asset classes and changes in the 
style composition of hedge funds. 

Keywords Alternative investments • Asset allocation • Hedge funds. 



1 Introduction 



During the last decade the interest in alternative asset classes has dramatically 
increased. This trend is based on investors’ beliefs that hedge funds and other 
alternative investments either outperform traditional asset classes or offer attractive 
diversification benefits (Bessler & Drobetz, 2008). In fact, there is empirical evi- 
dence that hedge fund managers in most style categories added value to investors’ 
portfolios by outperforming other asset classes on a risk-adjusted basis. Hedge funds 
have generated these exceptional returns so far by using a wide range of different 
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trading strategies and consequently have profited from these perceived opportuni- 
ties by attracting significant capital inflows. Moreover, fhis oulperformance or alpha 
seemed fo be persistent indicating that some managers really possessed superior 
information or investment skills. Obviously, there were limits to arbitrage as more 
investors became aware of these profitable opportunities. Consequently, the level of 
alpha has been declining over time which is often explained by a capacity effect and 
limited profit opportunities in small market segments. 

Even if nowadays hedge funds hardly outperform other asset classes, they still 
may be an attractive investment when viewed in a portfolio context because they 
may add value to investors’ portfolios by offering substantial diversification ben- 
efits. This result is due to the fact that their trading strategies often exhibit low 
correlations with conventional asset classes such as stocks and bonds during normal 
market conditions. However, there might exist a similar capacity effect leading to a 
decline in diversification benefits as more capital chasing similar trading opportu- 
nities makes it more difficult for hedge fund managers to construct portfolios that 
are truly independent of other risk factors. Moreover, the hedge fund industry has 
also been characterized by significant shifts in capital flows between its different 
segments or strategies. In particular, these shifts appear to have occurred in the 
aftermath of the stock market crash in 2000 and have induced further changes in 
the economic and statistical properties of hedge fund returns and of hedge funds as 
an asset class. 

The objective of this study is to investigate the implications that the historical 
performance, the growth of the hedge fund industry and the shift in capital flows 
befween differenl sfrategies had on fhe investment characteristics of hedge funds. 
After a review of the literature in Sect. 2 and a description of the dataset in Sect. 3, 
we empirically investigate three related questions in Sect. 4. We first analyze the 
impact that the investors’ forecasted alpha or optimism had on the asset allocation 
decision (Sect. 4.1) and then analyze possible market induced changes in the hedge 
fund industry. These are the evolution of diversification benefits over time (Sect. 4.2) 
and possible shifts in the style composition of hedge funds (Sect. 4.3). 



2 Literature Review 

In general, investors allocate capital to alternative investments in order to capture 
either abnormal returns (alpha) relative to traditional asset classes or additional 
diversification benefits. Usually, hedge funds claim to offer these superior risk- 
return trade-offs and low correlations with other asset classes due to their superior 
analytical and trading skills. This has resulted in a sharp increase in assets under 
management over the last couple of years. Indeed, a number of studies have sup- 
ported these claims and provided empirical evidence that hedge fund managers 
appear to possess superior skills and persistently generate positive alphas for their 
investors (Kosowski, Naik, & Teo, 2007). However, attracting substantial inflows 
may cause a negative impact on the ability to deliver positive alphas because many 
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strategies operate in niche markets in which only limited trading opportunities are 
available. In fact, Naik, Ramadorai, and Stromqvist (2007) and Khandani and Lo 
(2008) both find evidence of a capacity effect indicating that hedge funds’ alphas 
deteriorate subsequent to a superior performance because of large inflows at the 
individual strategy level. 

At the same time, hedge fund investments might provide investors with substan- 
tial diversification benefits by constructing portfolios whose systematic component 
is driven by new risk factors which exhibit low correlations with traditional asset 
classes such as equities and bonds (Bessler, Drobetz, & Holler, 2007). However, the 
potential diversification benefits vary substantially across different investment styles 
(Bessler, Drobetz, & Henn, 2005). At the one end of the spectrum are relative value 
strategies that attempt to exploit pricing differences between related securities and 
effectively earn returns by providing liquidity to capital markets. Hence, they exhibit 
low correlations with conventional risk factors and therefore should provide sub- 
stantial diversification benefits. However, their correlations exhibit a phase-locking 
behavior and increase significantly during market crashes due to their tail risk expo- 
sure. At the other end of the spectrum are directional strategies which take bets 
on the future movements of equity, fixed income, and currency markets in order 
to exploit market trends and investor sentiment. Consequently, these strategies offer 
only limited diversification potential to investors. Finally, event-driven strategies are 
located in between these two extremes in terms of diversification benefits because 
their positions on corporate events naturally contain some equity market exposures. 
Thus, it seems interesting to analyze possible changes in diversification benefits and 
changes in the asset allocation with respect to different hedge fund strategies. In 
addition it is of interest whether higher expected risk-adjusted returns (alphas) have 
an impact on the asset allocation decision. In particular, several authors adopt a 
Bayesian approach and provide evidence that asset allocation decisions into equity 
and bond mutual funds as well as hedge funds are driven by investors’ expectations 
about future returns (Cvitanic, Lazrak, Martellini, & Zapatero, 2003). 



3 Data and Descriptive Statistics 

In the empirical analysis we include various asset classes for the portfolio analysis 
and various hedge fund indices for the style analysis. We employ monthly return 
data from the CSFB Tremont Hedge Fund Index since its inception in December 
1993 until May 2008. Real estate returns are calculated by using monthly returns on 
the NAREIT total return index which reflects the performance of all publicly traded 
REITs in the US capital market. Furthermore, we include the GSCI total return 
index to represent the returns of an investment in commodities. The protective-put- 
strategy captures the volatility exposure in the returns of an option-writing strategy 
constructed from the DAXplus protective-put index. Finally, we select the MSCI 
Europe and the REX indices to capture the performance of equities and bonds. 
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Table 1 Descriptive statistics 



Asset class 


Mean 


Std. dev 


Skew 


Kurt 


Sharpe 


JB-test 


Panel A: Univariate statistics 1994-2008 










REX 


0.0005 


0.0091 


-0.2804 


2.7833 


-0.2660 


2.5096 


MSCI Europe 


0.0078 


0.0424 


-0.5217 


4.0326 


0.1151 


15.4436 


NAREIT 


0.0103 


0.0396 


-0.6308 


4.2283 


0.1863 


22.2202 


GSCI 


0.0095 


0.0476 


-0.1723 


2.8851 


0.1382 


0.9461 


Protective put 


0.0024 


0.0262 


4.0483 


34.1291 


-0.0198 


7415.50 


CSEB Tremont 


0.0087 


0.0216 


0.1008 


5.4010 


0.2676 


41.6073 


Panel B: Univariate statistics 1996-2001 










REX 


0.0009 


0.0081 


-0.4458 


3.3293 


-0.2537 


1.9925 


MSCI Europe 


0.0081 


0.0445 


-0.5421 


3.7505 


0.1162 


3.6665 


NAREIT 


0.0077 


0.0385 


0.1745 


3.7180 


0.1236 


1.0768 


GSCI 


0.0062 


0.0481 


-0.0559 


2.4335 


0.0676 


0.9968 


Protective put 


0.0016 


0.0219 


0.1981 


2.9936 


-0.0626 


0.4002 


CSEB Tremont 


0.0II2 


0.0293 


-0.1386 


3.8606 


0.2818 


1.3803 


Panel C: Univariate statistics 2003-2008 










REX 


0.0000 


0.0082 


0.0703 


2.9210 


-0.3019 


0.1119 


MSCI Europe 


0.0109 


0.0328 


-0.7987 


4.8573 


0.2602 


9.5731 


NAREIT 


0.0080 


0.0509 


-0.9796 


3.9107 


0.1107 


8.1697 


GSCI 


0.0128 


0.0546 


-0.3132 


2.9427 


0.1914 


0.7966 


Protective put 


0.0028 


0.0114 


0.5809 


3.2319 


0.0394 


2.5475 


CSEB Tremont 


0.0076 


0.0131 


-0.2559 


2.4144 


0.3981 


1.3334 



Table 2 Correlation matrix 1994-2008 



Asset class REX 


MSCI 


NAREIT 


GSCI 


Prot. put 


CSEB 


REX 1 


-0.1527 


-0.0050 


0.0060 


0.1203 


0.0594 


MSCI 


1 


0.3107 


-0.0856 


-0.0864 


0.4810 


NAREIT 




1 


-0.0293 


-0.0041 


0.2216 


GSCI 






1 


-0.0897 


0.0474 


Prot. put 








1 


0.1652 


CSEB 










1 



respectively, as the most popular traditional asset classes. The descriptive statistics 
characterizing these asset classes are provided in Tables 1 and 2. 

Over all periods, hedge funds have the highest sharpe ratio suggesting that his- 
torically this was the best performing asset class on a stand-alone basis. However, 
a high sharpe ratio is often caused by significant deviations of the return distribu- 
tion from normality and tends to go hand-in-hand with more mass in the tails of the 
return distribution. Jarque-Berra tests support this notion for our hedge fund data as 
well as for equity and real estate returns. 

Moreover, hedge fund returns are significantly correlated with stock returns for 
the entire sample period from 1994 to 2008 (p = 0.4810). Furthermore, while the 
correlation was only p = 0.6067 during the first up-market between 1996 and 2001 
it increased to p = 0.8306 during the second up-market period between 2003 and 
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2008. This indicates that hedge funds were able to reduce their market exposure 
somewhat during down-markets (p = 0.371 1). Nevertheless, it is interesting to note 
that the lowest exposure occurred shortly after the inception of the CSFB Tremont 
index at p = 0.1129(1994-1996). 

Consequently, the diversification benefits of hedge funds have decreased over 
time and are currently smaller than usually expected. Nevertheless, it seems fair 
to conclude that at least in the past there existed some diversihcation benefits, but 
given the upward trend in the correlation between stocks and hedge funds over time 
as revealed in our empirical analysis, these benefits eventually have vanished in 
recent years. 



4 Portfolio Benefits and Capital Flows 

Investments into hedge funds and other alternative asset classes are based on the 
belief that these new asset classes will generate alpha and offer additional diversifi- 
cation opportunities. It needs to be recognized, however, that substantial allocations 
into hedge funds by many investors might eventually decrease these benefits because 
there are only limited trading and profit opportunities available in capital markets. 
In order to evaluate the potential magnitude of this effect, we relate investor’s 
expectations on alpha to their allocation decision in hedge funds in Sect. 4.1. In 
Sect. 4.2 we analyze hedge funds’ diversification benehts over time and in Sect. 4.3 
we investigate possible shifts in the style composition of the hedge fund industry. 



4.1 Expected Alpha and Allocation into Hedge Funds 

The purpose of our first analysis is to highlight the impact of investors’ return expec- 
tations on asset allocation. For this we combine forecasts of the hrst two moments 
of asset returns from a simple Bayesian regression framework with mean-variance 
analysis subject to short-sale constraints. More precisely, we regress returns of the 
CSFB Tremont Hedge Fund index on the returns of the other asset classes using an 
informative natural-conjugate Normal-Gamma prior on alpha and beta: 

This allows us to establish investor’s allocation into hedge funds for different expec- 
tations about the mean of alpha. Moreover, we make this prior non-informative 
about beta and the volatility of alpha by setting the prior mean of beta and the 
prior volatility of alpha equal to their OLS-estimates. Furthermore, the investor puts 
equal weight on sample information and his own expectations in our set-up. 

In Fig. I we present efficient frontiers and portfolio weights calculated with all 
available asset classes for three different expectations on hedge fund alpha. In the 
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Fig.l 




Expectations on alpha and asset allocation 




case of an expected alpha of zero (“Alpha 0”), the investor is sceptical and does 
not believe that hedge funds can generate abnormal returns (alpha). In the case of 
“Alpha 1”, the investor’s expectations are equal to the sample OLS-estimate for 
alpha. Finally, “Alpha 2” implies that the investor is quite optimistic and expects 
that hedge funds generate two-times more alpha than indicated by the historical 
data. These results are presented in Fig. 1 . 

As expected, the investor’s allocation to hedge funds for all levels of risk aversion 
is highly sensitive to the expected alpha. If the investor expects that alpha will be 
equal to the sample estimate for alpha, he will predominantly reallocate capital from 
bonds to hedge funds with decreasing risk aversion. Only for very low levels of risk 
aversion, investors move from hedge funds into real estate due to higher expected 
returns on real estate - although at a higher risk in terms of standard deviation. 
These results become even more pronounced when the investor hopes for a higher 
alpha than warranted by sample evidence (“Alpha 2”). However, even if the investor 
does not believe that hedge funds are able to generate alpha, the sample evidence on 
alpha leads him to make a small allocation to hedge funds. 
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In contrast to Cvitanic et al. (2003), we find that efficient frontiers are not very 
sensitive to the expected volatility of alpha. This is due to the fact that we use 
a larger number of asset classes which decreases the influence of each individual 
asset class while alpha is uncorrelated with the other asset classes. However, hedge 
fund returns are non-normally distributed. This raises some doubts whether mean- 
variance analysis is applicable when constructing portfolios involving hedge funds 
given the fact that investors might exhibit preferences for certain combinations of 
higher moments (Bessler et al., 2005). Interestingly, Jarque-Berra tests do not reject 
the null hypothesis of normally distributed portfolio returns in our sample as long as 
the proportion of hedge funds in the portfolios remains below approximately 35%. 
Furthermore, time variation in the conditional moments and co-moments of hedge 
funds’ returns might generate intertemporal hedging demand leading to significant 
deviations between the optimal dynamic and the optimal static mean-variance- 
portfolio. However, so far there is only limited research on the time-variation in 
the conditional moments and co-moments of hedge fund returns. 



4.2 Reduction of Diversification Benefits Over Time 

In order to investigate the changes in diversification benefits of hedge fund invest- 
ments over time, we analyze changes in the correlations between hedge fund returns 
and returns on other asset classes using simple variance decompositions with a 
rolling window of 36 months: 



Share (Factor n) 



PlnVjfn) 

Vtii) 



In order to address multi-collinearity problems, we orthogonalize the time series of 
all asset returns with respect to the stock market. 

The results - which reveal two interesting findings - are presented in Fig. 2. 
First, the correlation of the aggregate hedge fund market with the stock market 
has increased over the sample period with a minor reversal in 2001. Second, from 
the other factors only interest rate risk seems to have some impact on hedge fund 
returns during specific time intervals. These findings indicate that the diversification 
benefits offered by hedge funds have decreased dramatically over time. Although 
investors who took early positions in hedge funds such as US university endowments 
were able to capture substantial portfolio benefits from hedge fund investments, 
changes in the hedge fund industry might have led to a significant deterioration 
in investment opportunities. This reduction in diversification benefits might be the 
result of higher capital inflows into hedge funds but also of a growing institution- 
alization of the hedge fund industry. This could have reduced the flexibility of 
managers to react quickly to opportunities in capital markets. 
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Fig. 2 Factor structure over time 



4.3 Structural Breaks 

In addition to the increase in hedge fund assets under management there has 
occurred also a significant shift in the investors’ asset allocations and in capital flows 
between the various hedge fund strategies. In order to investigate possible shifts in 
the style composition of the hedge fund universe, we analyze the changes that have 
occurred in the asset allocation decision. One way to test for these changes is to 
employ the style analysis developed by Sharpe. For this, we conduct simple style 
regressions 

^HF — a l^\Fstyle 1 “F * * * “h Style m “F 

where the different styles are approximated by the style indices provided by CSFB 
Tremont. 

Table 3 reports the results from the style regressions for the 5-year time-periods 
from April 1996 to March 2001 and from April 2003 to March 2008. The results are 
in line with our expectations. In particular, during the first time period the aggre- 
gate hedge fund market was dominated by global macro, fixed income arbitrage, 
and short-selling which are the only styles with statistically significant coefficients. 
During the second time period, the hedge fund industry has become much more 
diversified over the different strategies with the majority of styles having significant 
coefficients. Moreover, Chow- and Wald-tests indicate the presence of a structural 
break around the end of the dotcom-bubble at the 1%-level. It is interesting to note 
that the increasing importance of non-directional strategies goes hand in hand with 
an increasing dependence on other asset markets (see Sect. 4.2), although these 
strategies should exhibit low correlations with other asset classes. 

Our results are supported by an analysis of the changes in hedge funds return 
volatility over time. Figure 3 shows the volatility of hedge fund returns using a 
36-month rolling window. This figure reveals that there has occurred a significant 
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Table 3 Style decomposition 



Asset class 


4/96-3/01 


4/03 - 3/08 


Alpha 


-0.0001 


-0.0003 


CB-arbitrage 


0.0340 


0.0273 


Short selling 


-0.0532 


0.0108 


Emerging markets 


-0.0226 


0.0425 


Equity mark, neutral 


-0.3365 


0.0541 


Event driven 


0.1839 


0.5878 


Distressed 


0.2374 


-0.1684 


Multi-strategy 


-0.0392 


-0.1849 


Risk arbitrage 


0.1201 


-0.0198 


Eixed income arb. 


0.3623 


0.0506 


Global macro 


0.3272 


0.1190 


Long-short equity 


0.0036 


0.3047 


Managed futures 


0.0232 


0.1308 




0.9983 


0.9983 




reduction in volatility at the aggregate hedge fund level at about the same time of 
the stock market crash in 2000. One possible interpretation of this finding is that an 
increase in risk aversion led investors to reallocate capital from rather risky hedge 
fund strategies such as global macro to strategies with presumably lower risk such 
as relative value strategies. 



5 Conclusion 

Since the beginning of this decade investors have started to allocate large sums 
of capital into hedge funds in expectation of positive abnormal returns and sub- 
stantial diversification benefits. We provide evidence that a more optimistic prior 
belief about expected risk-adjusted returns (alpha) leads to a higher allocation into 
hedge funds. However, the stock market crash in 2000 had a significant impact on 
volatility, on asset allocation, and on the investment properties of hedge funds as an 
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asset class. In particular, the correlation with the aggregate stock market increased 
substantially over the time period from 1994 to 2008, with a minor decline in 
2000. Moreover, the style composition of the hedge fund universe also changed 
dramatically over the same time period from more risky to presumably less risky 
strategies. 

These results have important implications for investors allocating some of their 
capital to hedge funds. Most importantly, history may not be the best guide for 
future performance. In particular, given this instability in investment characteris- 
tics investors can hardly assess ex-ante whether hedge funds will really provide the 
desired returns and diversification effects. Moreover, the variability in the invest- 
ment properties of hedge funds casts doubts on the practice of considering hedge 
funds as one single homogenous asset class in strategic asset allocation decisions. 
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Mixture Hidden Markov Models 
in Finance Research 



Jose G. Dias, Jeroen K. Vermont, and Sofia Ramos 



Abstract Finite mixture models have proven to be a powerful framework when- 
ever unobserved heterogeneity cannot be ignored. We introduce in finance research 
the Mixture Hidden Markov Model (MHMM) that takes into account time and 
space heterogeneity simultaneously. This approach is flexible in the sense that it 
can deal with the specific features of financial time series data, such as asymme- 
try, kurtosis, and unobserved heterogeneity. This methodology is applied to model 
simultaneously 12 time series of Asian stock markets indexes. Because we selected 
a heterogeneous sample of countries including both developed and emerging coun- 
tries, we expect that heterogeneity in market returns due to country idiosyncrasies 
will show up in the results. The best fitting model was the one with two clusters at 
country level with different dynamics between the two regimes. 

Keywords Finite mixture model • Hidden Markov model • Market volatility ■ 
Model-based clustering • Stock indexes. 



1 Introduction 

Finite mixture modeling has been a powerful tool for capturing unobserved hetero- 
geneity in a wide range of social and behavioral science data (see, for example, 
McLachlan & Peel, 2000 or Dias & Vermunt, 2007). Modeling the dynamics of 
stock market returns has been an important challenge in modern financial economet- 
rics. The statistics and dynamics of correctly specified distributions provide more 
accurate and detailed input for financial asset pricing and risk management. For 
example, investors buy or sell securities according to their expectation of the market 
state. In addition, portfolio risk reduction might be achieved by procedures that 
take into account the synchronization of market regimes. We introduce a specific 
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finite model for financial time series analysis that takes into account unobserved 
heterogeneity across space and time. Here, this methodology is used to model the 
dynamics of the returns of 12 stock market indexes. 

As illustrated below, the proposed approach is flexible in the sense that it can deal 
with the specific features of financial time series data, such as asymmetry, kurtosis 
and unobserved heterogeneity. Having selected a heterogeneous sample of coun- 
tries including both developed and emerging countries from Asia, we expect that 
heterogeneity in market returns due to country specificities will show up in the 
results. For instance, emerging market return distributions show larger deviations 
from normality; i.e., are more skewed and have fatter tails (Harvey, 1995). 

The paper is organized as follows; Sect. 2 presents the full mixture hidden 
Markov model; Sect. 3 describes the 12 stock market time series that are used 
throughout this paper. Section 4 reports MHMM estimates. The paper concludes 
with a summary of the main findings. 



2 The Mixture Hidden Markov Model 

We model simultaneously the time series of n stock markets returns. Let yn 
represent the response of observation (stock market) i at time point t, where 
i G {1, ...,«}, f G {1, ... , T}. In addition to the observed “response” variable yn, 
the MHMM contains two different latent variables: a time-constant discrete latent 
variable and a time- varying discrete latent variable. The former, which is denoted 
bywG{l,...,5'} captures the unobserved heterogeneity across stock markets; that 
is, stock markets are clustered based on differences in their dynamics. We will refer 
to a model with S clusters as MHMM-S. The two-state time- varying latent variable 
is denoted by G {1,2}. 

Let / (y, ; (p) be the (probability) density function associated with the index return 
rates of stock market i, where (p is the vector of parameters in the model. The 
MHMM-S defines the following parametric model for this density: 

S 2 2 T T 

/(yL</)) = /('^)/Uil'^) ]~[ f{zt\zt-\,w) ]~[ f{yit\zt). (1) 

w=lzi = l Z7- = l t=l t = \ 

As in any mixture model, the observed data density / (y, ; ^) is obtained by marginal- 
izing over the latent variables. Because in our model these are discrete variables, this 
simply involves the computation of a weighted average of class-specific probability 
densities where the (prior) class membership probabilities or mixture proportions 
serve as weights (McLachlan & Peel, 2000). We assume that within cluster w the 
sequence {z\. ... ,Zt} is in agreement with a first-order Markov chain. Moreover, 
we assume that the observed return at a particular time point depends only on the 
regime at this time point; i.e., conditionally on the latent state Zt, the response yn 
is independent of returns at other time points, which is often referred to as the local 
independence assumption. As far as the first-order Markov assumption for the latent 
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regime switching conditional on cluster membership w is concerned, it is impor- 
tant to note that this assumption is not as restrictive as one may initially think. It 
does clearly not imply a first-order Markov structure for the responses yn. The 
standard hidden Markov model (HMM) (Baum, Petrie, Soules, & Weiss, 1970) is 
a special case of the MHMM-S that is obtained by eliminating the time-constant 
latent variable w from the model, that is, by assuming that there is no unobserved 
heterogeneity across countries. 

The characterization of the MHMM is provided by: 

• / (w) is the prior probability of belonging to a particular cluster w with multino- 
mial parameter :7rH, = P{W = w). 

• / (zi \^) is the initial-regime probability; that is, the probability of having a partic- 
ular initial regime conditional on belonging to cluster w with Bernoulli parameter 
Xk^. = P{Z, =k\W = w). 

• f (zt\zt-\. w) is a latent transition probability; that is, the probability of being in a 
particular regime at time point t conditional on the regime at time point f — 1 and 
cluster membership; assuming a time-homogeneous transition process, we have 
Pjkw = P(Zt = k\Zt-i = j,W = w) as the relevant Bernoulli parameter. In 
other words, within cluster w one has the transition probability matrix 



with pi 2 w = I — pnw and p 2 iw = 1 ~ Piiw Note that the MHMM-S 
allows that each cluster has its specific transition or regime-switching dynamics, 
whereas in a standard HMM it is assumed that all cases have the same transition 
probabilities. 

• / iyit\zt), the probability density of having a particular observed stock return in 
index i at time point f, conditional on the regime occupied at time point f, is 
assumed to have the form of a univariate normal (or Gaussian) density function. 
This distribution is characterized by the parameter vector 9k = (p-k , cr|) contain- 
ing the mean (pk) and variance (ct|) for regime k. Note that these parameters are 
assumed invariant across clusters, an assumption that may, however, be relaxed. 

Since /(y, ;^), defined by (1), is a mixture of densities across clusters w and 
regimes, it defines a flexible Gaussian mixture model that can accommodate devi- 
ations from normality in terms of skewness and kurtosis. The two-state MHMM-S 
has 45-1-3 free parameters to be estimated, including 5 — 1 class sizes, 5 initial- 
regime probabilities, 25 transition probabilities, two conditional means, and two 
conditional variances. 

Maximum likelihood (ML) estimation of the parameters of the MHMM-S invol- 
ves maximizing the log-likelihood function: l{(p; y) = log/CyL a prob- 

lem that can be solved by means of the Expectation-Maximization (EM) algorithm 
(Dempster, Laird, & Rubin, 1977). The E step computes the joint conditional dis- 
tribution of the T + I latent variables given the data and the current provisional 
estimates of the model parameters. In the M step, standard complete data ML 
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methods are used to update the unknown model parameters using an expanded data 
matrix with the estimated densities of the latent variables as weights. Since the EM 
algorithm requires us to compute and store the S x 2^ entries in the E step this makes 
this algorithm impractical or even impossible to apply with more than a few time 
points. However, for hidden Markov models, a special variant of the EM algorithm 
has been proposed that is usually referred to as the forward-backward or Baum- 
Welch algorithm (Baum et al., 1970). The Baum-Welch algorithm circumvents 
the computation of this joint posterior distribution making use of the conditional 
independencies implied by the model. As shown by Vermunt, Tran, and Magid- 
son (2008), the Baum-Welch algorithm for HMMs can easily be generalized to the 
mixtures of HMMs. 

An important modeling issue is the setting of S, the number of clusters needed 
to capture the unobserved heterogeneity across stock markets. The selection of S is 
typically based on information statistics such as the Bayesian Information Criterion 
(BIC) (Schwarz, 1978). In our application we select S that minimizes the BIC value 
defined as: 

BICs = -Us{<p;y) + Nslogn, (2) 

where Ns is the number of free parameters of the model and n is the sample size. 



3 Data Set 

The data set used in this article are daily closing prices from 4 July 1994 to 27 
September 2007 for 12 Asian stock market indexes drawn from Datastream database 
and listed in Table 1. The series are expressed in US dollars. In total, we have 3,454 
end-of-the-day observations per country. Let Pn be the observed daily closing price 
of market i on day t,i = 1 , . . . , n and t = 0, . . . ,T. The daily rates of return are 
defined as the percentage rate of return y, , = 100 x \og(Pu/Pu-i),t = 
with T = 3,454. 

The sample has some appealing features as it mixes developed and emerging 
markets of the Asian region. Major companies like S&P or MSCI develop regional 
indices because of the presumption that neighbor countries are economically inter- 
related. For instance, neighbor countries have more intense trade and, as a result, 
“cycles” related to one neighbor are likely to affect the other neighbor country. 
Therefore, one could expect some homogeneity on the behavior of such coun- 
tries. One the other hand, international stock markets are divided in developed in 
emerging markets because of distinguished features of both markets. Therefore, the 
methodology will provide an opportunity to investigate how countries cluster in 
that region and whether it is indeed the case that neighbor countries have similar 
regime-switching propensities. 

Table 1 provides descriptive statistics of the time series, while Fig. 1 depicts the 
full time series. The sample period includes periods of market instability as the 
Asian Flu Crises of 1997, the Russian Crises of 1998, and the global stock mar- 
ket downturn of the 2001 following the dot com bubble. It can be seen that both the 
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Table 1 Summary statistics 



Stock market 


Mean 


Median 


Std. deviation 


Skewness 


Kurtosis 


Jarque-Bera test 
Statistics /t-Value 


Australia (AU) 


0.043 


0.047 


1.066 


-0.265 


4.011 


2,340 


0.000 


China (CH) 


0.049 


0.012 


1.864 


0.007 


5.128 


3,760 


0.000 


Hong Kong (HK) 


0.033 


0.009 


1.506 


0.001 


11.402 


18,620 


0.000 


India (IN) 


0.038 


0.041 


1.556 


-0.448 


4.756 


3,350 


0.000 


Japan (JP) 


-0.003 


0.000 


1.362 


0.109 


3.114 


1,390 


0.000 


Malaysia (MY) 


0.004 


0.000 


1.822 


-1.565 


73.976 


785,490 


0.000 


New Zealand (NZ) 


0.027 


0.042 


1.057 


-0.612 


9.353 


12,740 


0.000 


Pakistan (PK) 


0.009 


0.000 


1.874 


-0.377 


6.491 


6,110 


0.000 


Philippines (PH) 


0.000 


0.000 


1.556 


0.832 


15.513 


34,870 


0.000 


Singapore (SG) 


0.020 


0.047 


1.263 


-0.007 


7.090 


7,200 


0.000 


Taiwan (TA) 


0.010 


0.000 


1.685 


-0.145 


3.176 


1,450 


0.000 


Thailand (TH) 


-0.011 


0.000 


2.098 


0.332 


8.409 


10,190 


0.000 





years 



years 



Fig. 1 Time series of index rates for 12 Asian region stock markets 



mean and the median return rates are positive and close to zero, except for Japan and 
Thailand. Stock markets show, instead, very diverse patterns of dispersion, where 
the largest standard deviations are found in Thailand, China and Malaysia and the 
smallest dispersion in New Zealand and Australia. Higher standard deviations are 
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typical for emerging markets, known for their high risk. Return rate distributions are 
diverse in terms of skewness and the kurtosis (which equals 0 for normal distribu- 
tions) shows high positive values, indicating heavier tails and more peakness than 
the normal distribution. The Jarque-Bera test rejects the null hypothesis of normal- 
ity for all 12 stock markets. Overall, these stock market features seem well suited to 
be modeled using MHMMs. 



4 Results 

This section reports the results obtained when applying the MHMM-S described 
before to these 12 stock markets. We estimated models characterized by different 
number of clusters (5 = 1, . . . , 8), using for the estimation of each of them 300 
different starting values for the parameters to avoid local maxima. The model with 
two clusters (5 = 2) yielded the lowest BIC value (£ 2 ( 1 ^; y) = —70,256.1081, 
A ^2 = 11 and BIC 2 = 140,539.6). 

Table 2 summarizes the results related to the distribution of stock market across 
clusters which gives the size of each cluster. The prior class membership probability 
shows that both clusters have the same size. From the posterior class membership 
probabilities, the probability of belonging to each of the clusters conditional on 
the observed data (Table 2), we found six countries assigned to cluster 1 (China, 
India, Japan, Pakistan, Taiwan, and Thailand) and six countries as well assigned to 
cluster 2 (Australia, Hong Kong, Malaysia, New Zealand, Philippines, and Singa- 
pore). Notice that from the posterior probabilities the modal allocation into classes 
is precise (the probability of the most likely cluster is always one or very close 
to one). Notice also that cluster 1 has mostly emerging market countries with the 
exception of the Japan, while cluster 2 is composed mainly by developed countries 
with the exception of Malaysia and Philippines. By combining the classification 



Table 2 Estimated prior and posterior probabilities, and modal clusters for the MHMM-2 



Stock market 


Cluster 1 


Cluster 2 


Modal cluster 


Prior probabilities 


0.501 


0.499 




Posterior probabilities 
Australia (AU) 


0.000 


1.000 


2 


China (CH) 


1.000 


0.000 


1 


Hong Kong (HK) 


0.000 


1.000 


2 


India (IN) 


1.000 


0.000 


1 


Japan (IP) 


0.992 


0.008 


1 


Malaysia (MY) 


0.000 


1.000 


2 


New Zealand (NZ) 


0.000 


1.000 


2 


Pakistan (PK) 


1.000 


0.000 


1 


Philippines (PH) 


0.019 


0.981 


2 


Singapore (SG) 


0.000 


1.000 


2 


Taiwan (TA) 


1.000 


0.000 


1 


Thailand (TH) 


1.000 


0.000 


1 
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Table 3 Estimated marginal probabilities of the regimes and within Gaussian parameters 



P(Z) 


Return (mean) 


Risk (variance) 


Regime 1 


Regime 2 


Regime 1 


Regime 2 


Regime 1 


Regime 2 


Estimate 0.2545 


0.7455 


-0.1025 


0.0596 


7.3521 


0.8802 


Std. error (0.0280) 


(0.0280) 


(0.0276) 


(0.0060) 


(0.1439) 


(0.0116) 


Table 4 Characterization of the switching regimes 










Cluster 1 


Cluster 2 






Regime 1 


Regime 2 


Regime 1 


Regime 2 




P(Z\W) 


0.3487 


0.6513 


0.1601 


0.8399 






(0.0141) 


(0.0141) 


(0.0135) 


(0.0135) 




Transitions 












Regime 1 


0.9047 


0.0953 


0.9349 


0.0651 






(0.0068) 


(0.0068) 


(0.0063) 


(0.0063) 




Regime 2 


0.0512 


0.9488 


0.0124 


0.9876 






(0.0035) 


(0.0035) 


(0.0012) 


(0.0012) 





information with the descriptive statistics in Table 1, Cluster 1 tends to contain 
countries with higher volatility (except Japan) and cluster 2 aggregates countries 
with lower volatility, except mainly Malaysia. As it will become clear the main 
discrimination between these two groups has to do with other important factors. 

Table 3 provides information on the two regimes that were identified; that is, the 
average proportion of markets in regime k over time and the mean and variance of 
the returns in regime k. The result is in line with the common dichotomization of 
financial markets into “bull” and “bear” markets. Consistently, the reported means 
show that one of the regimes is associated with positive returns (bull market) and the 
other with negative returns (bear market). The probability of being in the bear and 
bull regimes is 0.25 and 0.75, respectively. We would also like to emphasize that 
these results are coherent with the common acknowledgment of volatility asymme- 
try of financial markets. Volatility is likely to be higher when markets fall than when 
markets rise. 

Table 4 reports the estimated probabilities of being in one of the regimes within 
each cluster. There is a clear distinction between these clusters. Cluster 1 has the 
largest probability of being in bear regime (0.35). For cluster 2 this probability 
becomes 0.16. Moreover, Table 4 provides another key insight from our analysis. 
It gives the transition probabilities between the two regimes for both clusters. First, 
notice that both clusters show regime persistence. Once a stock market jumps to a 
regime, it is likely to remain within the same regime for a while, which is coherent 
with stylized facts in financial markets. Second, cluster 2 shows lower propensity 
to move from a bull regime to a bear regime (0.012) than cluster 1. Third, cluster 1 
shows higher probability to jump from a bear to a bull regime than cluster 2. This is 
in line with the idea that cluster 1 has more emerging markets, which are known for 
having more and longer financial crises than developed markets. 

Figure 2 shows the regime-switching dynamics of the countries within both clus- 
ters. It depicts the posterior probability of being in bull regime at period t, where 
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a. Cluster 1 





Fig. 2 Estimated posterior bull regime probability and modal regime 
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the grey color identifies periods in which this probability is below 0.5 which corre- 
sponds to a higher likelihood of being in the bear regime. It is visible a long period 
of “bear regimes” that starts at the end of 1997, with the Thailand’s currency crisis 
and goes until 2002 that affects all the countries of the region. However, the behav- 
ior before 1997 and after 2002 is clearly different between countries from cluster 1 
and 2. The two clusters of countries have rather different pattern of regime switch- 
ing. Cluster 2 is more regime persistent with short duration bear regimes that did not 
turn out to be endemic during the period of analysis, despite critical periods around 
1998. Cluster 1 is extremely dynamic and tends to move very fast between regimes, 
switching frequently between bear and bull states. 



5 Conclusions 



A mixture of hidden Markov models allows model-based clustering of financial time 
series. In the analysis of a sample of 12 stock markets providing observations for a 
period of 3,454 days the best fitting model was the one with two clusters. The two 
clusters clearly defined two distinct types of regime switching, which is coherent 
with many stylized facts in finance. Moreover, the simultaneous analysis of the 12 
time series allows a better comparison of country dynamics in opposition to the 
application of Markov-switching approaches that estimate regimes for each country 
separatively (see, e.g., Wang & Theobald, 2008). 
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Multivariate Comparative Analysis of Stock 
Exchanges: The European Perspective 



Julia Koralun-Bereznicka 



Abstract The aim of the research is to perform a multivariate comparative analysis 
of 20 European stock exchanges in order to identify the main similarities between 
the objects. Due to the convergence process of capital markets in Europe the similar- 
ities between stock exchanges could be expected to increase over time. The research 
is meant to show whether and how these similarities change. Consequently, the dis- 
tances between clusters of similar stock exchanges should become less significant, 
which the analysis also aims at verifying. The basis of comparison is a set of 48 
monthly variables from the period January, 2003 to December, 2006. The variables 
are classified into three categories: size of the market, equity trading and bonds. The 
paper aims at identifying the clusters of alike stock exchanges and at finding fhe 
characteristic features of each of the distinguished groups. The obtained categoriza- 
tion to some extent corresponds with the division of the European Union into “new” 
and “old” member countries. Clustering method, performed for each quarter sepa- 
rately, also reveals that the classification is fairly stable in time. The factor analysis, 
which was carried out to reduce the number of variables, reveals three major factors 
behind the data, which are related with the earlier mentioned categories of variables. 

Keywords Cluster analysis ■ Eactor analysis ■ Stock exchanges. 



1 Introduction 

The convergence of financial markets has been researched for many decades. Most 
studies, mainly because of advances of integration process, involving common cur- 
rency introduction, tend to focus in the European area (Hasan & Schmiedel, 2004; 
Kim, Moshirian, & Wu, 2005; Pascual, 2003; Rockinger, 2000). Assuming that the 
progress of capital markets integration is nowadays an unquestionable occurrence 
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(Kearney & Lucey, 2004), the main aim of the paper is to analyse the effects of 
this process through identihcation of similarities between selected European stock 
exchanges. This is done with the use of several multivariate statistical methods, 
including cluster analysis and k-means grouping. The methodology of the research 
is to some extent determined by the number of both objects and diagnostic vari- 
ables. The stock exchanges are described by a number of variables characterizing 
many different aspects of the analysed capital markets. The paper aims at identify- 
ing the clusters of alike stock exchanges and at Ending the characteristic features of 
each of the distinguished groups. Another purpose is to simplify the data structure 
in order to better observe the regularities concerning analysed stock markets. The 
reduction of the number of variables describing the stock markets was attempted 
with the use of factor analysis. 



2 Data Description 

The subject of the analysis is a set of 20 European stock exchanges involving: 
Athens Exchange (ATH), Borsa Italiana (ITA), Bratislava Stock Exchange (BRA), 
Budapest Stock Exchange (BUD), Cyprus Stock Exchange (CYP), Deutsche Borse 
(DEU), Euronext (ENXT), Iceland Stock Exchange (ICE), Irish Stock Exchange 
(IRE), Ljubljana Stock Exchange (LJU), London Stock Exchange (LON), 
Luxembourg Stock Exchange (LUX), Malta Stock Exchange (MAE), OMX, Oslo 
B0rs (OSL), Prague Stock Exchange (PRA), Spanish Exchanges (BME), 
Warsaw Stock Exchange (WAR), SWX Swiss Exchange (SWX) and Wiener Borse 
(VIE). The objects were compared with the use of 48 monthly variables concern- 
ing the period from January, 2003 to December, 2006. They were categorized into 
three groups: variables describing the size of the market, Sj (Table 1), variables 
connected with equity trading, Ej (Table 2) and variables to do with bonds, Bj 
(Table 3). The source of the data is the Federation of European Securities Exchange 
(http://www.fese.be). All of the 48 variables are considered stimulants and are 
characterized with coefficient of variability significantly higher than 10%. 



Table 1 Variables characterizing size of stock exchanges 



Market capitalization 




No. of new 


Investment flows 


No. of companies 






companies listed 


channeled through the 
exchange (EUROm) 


with listed shares 


Value at month end 




Domestic ^4 


Newly issued S(, 


Domestic 5g 


(EUROm) 










% Change MoM in 


52 


Domestic 5s 


Newly issued Si 


Domestic 5g 


EURO 










On previous year end in 
EURO 


5a 









Multivai'iate Comparative Analysis of Stock Exchanges 



463 



Table 2 Variables characterizing equity markets 



Equity trading 




Domestic Foreign 


Value, year to date 


Electronic order book 


Trades 


El 


El 




En 




Transactions 


Turnover 


E2 


Eg 




Eli 




Negotiated deals 


Trades 


Ei 


Eg 




Eh 






Turnover 


E, 


Elo 




En 




% Change MoM in EURO 




Es 


En 








% Turnover velocity 




Ei 










Table 3 Variables characterizing bond trading 


Specification 




Domestic 


Domestic 


International 


Total, year 






public 


non public 






to date 






sector 


sector 








Electronic order 


Trades 


Bi 


Be 


Bn 




Bie 


Book transactions 


Turnover 


Bi 


Bi 


Bn 




Bn 


Negotiated deals 


Trades 


Bi 


Bg 


Bn 




Big 




Turnover 


B4 


Bg 


Bu 




Big 


% Change MoMin EURO 




Bs 


Bio 


Bn 






Listed bonds 




Bio 


Bn 


Bn 






New listed in the month 






Bn 








Money raised (EURO m) 






Bn 









3 Cluster Analysis 

One of the classification methods which enables distinguishing internally homoge- 
nous groups of objects is agglomerative cluster analysis, which at the same time 
is an effective way of simplifying large groups (Wishart, 1999). The first step of 
the cluster analysis is establishing a three-dimensional matrix of observations, con- 
sisting of flat tables for each period examined. The rows of each table represent 
the objects, i.e., stock exchanges, whereas the columns - variables analysed. Before 
clustering can be performed, the data has to be made comparable, as the variables are 
not expressed in the same additive units. This was done by the (0,1) standardization 
formula: 

Xij - min {xij } 

4 = , / . I r 

^ max [Xij ] — mm [Xij } 

where i is the number of object, j is the number of variable. 

The distances between the objects in the multivariate space were measured with 
the most commonly used Euclidean distance, whereas Ward’s method was chosen 
as the most suitable for linking clusters (Boillat, de Skowronsky, & Tuchschmid, 
2002). The cluster analysis was performed on the average of variables from the 48 
months’ period. The results of the grouping algorithm are shown in Fig. 1. 
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Linkage d is la n c e 

Fig. 1 Tree diagram, average monthly variables from 2003 to 2006, Euclidian distance. Ward’s 
method 



Cutting the branches of the tree where they are the longest forms two clearly 
distinguished homogenous clusters. The hrst one consists of hve stock exchanges 
(Italy, Euronext, Germany, Madrid and London) which are relatively better- 
developed in all three analytical areas, whereas the other cluster comprises all the 
15 remaining objects. When interpreting the length of linkages in the above graph, 
it can also be said that within the hrst category of stock markets, the elements are 
more different from each other than in the second cluster. The linkages are appar- 
ently longer, e.g., for London and Madrid (LON, BME) than for Bratislava and 
Ljubljana (BRA, LJU). It shows that the more numerous cluster of less developed 
stock exchanges is characterized with a higher homogeneity. Moreover, the fact that 
the stock exchanges of all the newly accessed EU countries are in the same cluster 
makes it clear that the distance between the old and new members is still signihcant. 
It is also obvious when looking at the linkage distance between the two clusters. 

In order to find possible changes in the structure of clusters, the above agglom- 
eration procedure was also performed for each quarter of the analysed period 
separately. The clustering method for each of the periods again leads to distin- 
guishing two clearly separated groups. The content of each of them is shown in 
Table 4. 

The structure of the identihed clusters show that their contents is quite stable and 
very much independent from the analytical period. Most elements of both clusters 
are placed in the same group and therefore make a sort of core of the cluster. The 
only exceptions are Luxembourg, Oslo and OMX (those objects were bolded in 
Table 4). It is also worth mentioning that the identihed clusters are pretty coincident 
with the still persisting and in a way natural division of our continent into the old 
and new member countries of the European Union. Although the obtained clusters 
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Table 4 Cluster analysis results for quarterly data from 2003 to 2006 

Period Elements of the cluster 

i if 



2003Q1 


ITA, ENXT, DEU, LON, BME 


2003Q2 


ITA, ENXT, DEU, LON, BME 


2003Q3 


ITA, ENXT, DEU, LON, BME, LUX 


2003Q4 


ITA, ENXT, DEU, LON, BME, LUX, 
OMX 


2004Q1 


ITA, ENXT, DEU, LON, BME 


2004Q2 


ITA, ENXT, DEU, LON, BME 


2004Q3 


ITA, ENXT, DEU, LON, BME 


2004Q4 


ITA, ENXT, DEU, LON, BME 


2005Q1 


ITA, ENXT, DEU, LON, BME, OMX 


2005Q2 


ITA, ENXT, DEU, LON, BME, LUX, 
OMX 


2005Q3 


ITA, ENXT, DEU, LON, BME, LUX, 
OMX 


2005Q4 


ITA, ENXT, DEU, LON, BME, OMX 


2006Q1 


ITA, ENXT, DEU, LON, BME, OMX 


2006Q2 


ITA, ENXT, DEU, LON, BME, OMX 


2006Q3 


ITA, ENXT, DEU, LON, BME, OMX, 
LUX, OSL 


2006Q4 


ITA, ENXT, DEU, LON, BME, OMX 



ATH, BRA, BUD, CYP, ICE, IRE, LJU, LUX, MAL, OMX, 
OSL, PRA, WAR, SWX, VIE 

ATH, BRA, BUD, CYP, ICE, IRE, LJU, LUX, MAL, OMX, 
OSL, PRA, WAR, SWX, VIE 

ATH, BRA, BUD, CYP, ICE, IRE, LJU, MAL, OMX, OSL, 
PRA, WAR, SWX, VIE 

ATH, BRA, BUD, CYP, ICE, IRE, LJU, MAL, OSL, PRA, 
WAR, SWX, VIE 

ATH, BRA, BUD, CYP, ICE, IRE, LJU, LUX, MAL, OMX, 
OSL, PRA, WAR, SWX, VIE 

ATH, BRA, BUD, CYP, ICE, IRE, LJU, LUX, MAL, OMX, 
OSL, PRA, WAR, SWX, VIE 

ATH, BRA, BUD, CYP, ICE, IRE, LJU, LUX, MAL, OMX, 
OSL, PRA, WAR, SWX, VIE 

ATH, BRA, BUD, CYP, ICE, IRE, LJU, LUX, MAL, OMX, 
OSL, PRA, WAR, SWX, VIE 

ATH, BRA, BUD, CYP, ICE, IRE, LJU, LUX, MAL, OSL, 
PRA, WAR, SWX, VIE 

ATH, BRA, BUD, CYP, ICE, IRE, LJU, MAL, OSL, PRA, 
WAR, SWX, VIE 

ATH, BRA, BUD, CYP, ICE, IRE, LJU, MAL, OSL, PRA, 
WAR, SWX, VIE 

ATH, BRA, BUD, CYP, ICE, IRE, LJU, LUX, MAL, OSL, 
PRA, WAR, SWX, VIE 

ATH, BRA, BUD, CYP, ICE, IRE, LJU, LUX, MAL, OSL, 
PRA, WAR, SWX, VIE 

ATH, BRA, BUD, CYP, ICE, IRE, LJU, LUX, MAL, OSL, 
PRA, WAR, SWX, VIE 

ATH, BRA, BUD, CYP, ICE, IRE, LJU, MAL, PRA, WAR, 
SWX, VIE 

ATH, BRA, BUD, CYP, ICE, IRE, LJU, LUX, MAL, OSL, 
PRA, WAR, SWX, VIE 



do not fully represent the groups of old and new members, as one could expect, the 
stock exchanges of all of the countries which accessed the EU later are in the same 
cluster. This regularity can be observed in all periods considered. 



4 K-means Grouping 

The hitherto analysis has led to identifying two groups of objects. This in turn is a 
starting point for another classification method, i.e., k-means grouping, where it is 
necessary to declare the number of clusters in advance. The aim of the algorithm is 
to create k clusters of greatest possible distinction which are formed by displacing 
objects between the groups in order to minimise the within-group variability and 
maximise the between group variability (Wishart, 2001). 
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V a ria b le s 



Fig. 2 Average of variables for each cluster 



This algorithm, when first carried out for average variables for the whole analyti- 
cal period, has led to forming clusters, in first of which there were five objects (ITA, 
DEU, ENXT, LON, BME) constituting the core of one of the two clusters identified 
with the use of previous agglomeration method. In order to find out the characteris- 
tic features of the formed clusters, the average variables can be analysed for each of 
them, which are presented in Eig. 2. 

Erom the comparison of the averages it can be concluded that most of the vari- 
ables are of high discriminatory power, as they differ significantly between the 
groups. Moreover, the figure shows that the stock exchanges from the first clus- 
ter are much better developed than their counterparts form the other group. Most 
variables are clearly higher for the first cluster. This can be observed in case of 
variables concerning size of the markets (except from variables S 2 and S 3 ), as 
well as variables describing stocks, especially equity. In case of equity market 
variables both clusters are most distinct. However, also when considering bonds 
markets, the first cluster is characterized with generally better development parame- 
ters (the only exceptions being variables B 5 and B\q, which are higher in the second 
cluster). 
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5 Factor Analysis 

A natural procedure when dealing with a relatively high number of data, some of 
which are strongly correlated, is to reduce their number in order to clarify the whole 
picture. One of the methods of simplifying the structure of variables is the factor 
analysis, which was applied in this case. The goal of factor analysis, as a method of 
quantitative multivariate analysis, is to represent the interrelationships among a set 
of continuously measured variables by a number of underlying, linearly indepen- 
dent reference variables called factors. Factor analysis is performed by examining 
the pattern of correlations (or covariances) between the observed variables. Those 
of them which are highly correlated (either positively or negatively) are likely to 
be influenced by the same factors, while those that are relatively uncorrelated are 
likely influenced by different factors (Krzanowski, 1988; Morrison, 1967). In order 
to obtain a reasonable number of factors, a stopping rule has to be adopted. As the 
Kaiser criterion, which says that all factors with their eigenvalue above 1 are sig- 
nificant, would leave us with too many (8) factors difficult to interpret, the scree 
plot test was applied. As a result, three factors were distinguished as meaningful. 
The correlation between each of the selected factors and the original variables is 
shown in Table 5. Variables insignificantly correlated were removed to enhance the 
legibility. 

Table 5 allows to formulate an interpretation of the factors, all three of which 
explain almost 68% of the variability. The first factor is highly correlated with the 
biggest number of variables, most of which are linked with the equity markets. 
Therefore it could be named as an “equity factor”. Factor 2 is mainly connected 
with the bonds variables, so a natural interpretation here would be “bonds factor”. 



Table 5 Factor loads (>,7), Varimax normalized 





Factor 1 




Factor 2 




Factor 3 


Variable 


Foad 


Variable 


Foad 


Variable 


Load 




El 


0,831 


El 


0,974 


54 


0,85536 




El 


0,723 


E9 


0,963 


Se 


0,86904 




Es 


0,916 


Ei4 


0,978 


Si 


0,73871 




E(. 


0,784 


Bs 


0,771 


Ss 


0,89628 




El 


0,881 


Bn 


0,919 


E4 


0,78838 




Eia 


0,773 


Bn 


0,910 


En 


0,72911 




En 


0,739 


Bio 


0,734 


Bi 


0,72592 




Ell 


0,847 


Bn 


0,777 


B4 


0,95269 




En 


0,732 


Bn 


0,927 


Bio 


0,92564 




Bi 


0,799 












Be 


0,719 












Bi4 


0,913 












Bi6 


0,928 










Variance expl. 




11,493 




10,494 




10,50987 


Contribution 




0,239 




0,219 




0,21896 



Source: author’s own calculations based on http://www.fese.be 
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The last factor could be defined as a “general economic situation factor” as it is 
mainly influenced by size variables, but also by equity and bonds domestic trading. 



6 Summary and Conclusions 

The main conclusion drawn from the multivariate comparison of the analysed stock 
exchanges is the heterogeneity of the examined population. The diversity of the 
objects can be seen in all three aspects involved in the analysis, i.e., in the area of size 
of the markets, in the development of equity markets, as well as bond markets. Con- 
sidering the identified similarities between the objects throughout the whole period, 
two categories of stock markets can be distinguished. The group of highly developed 
stock exchanges includes London SE, Deutsche Borse, Euronext, Borsa Italiana 
and Spanish Exchanges, which is confirmed both with the use of agglomerative 
clustering and k-means grouping. 

Factor analysis, which was meant to discover the structure of the data and to 
reduce the initially high number of variables, revealed three principal components, 
which to some extent correspond with the previous data categorisation. 

Moreover, the analyses prove that the dissimilarities between European stock 
markets are still significant. It is shown for example by the fairly stable content of 
clusters obtained both with the use of cluster analysis and k-means grouping in each 
quarter of the examined period. Both methods revealed the clear division into two 
distinguishable clusters. One of them usually contained German, Italian, British, 
Spanish stock exchange, as well as Euronext. These objects are mainly characterised 
with considerably better average parameters describing their development, both in 
terms of size of the markets, as well as equity and bonds trading. The remaining 
stock exchanges, on the contrary, are usually smaller and with lower ratio values. 
However, they seem to be less diversified within their group in comparison with 
the better developed cluster. The identified discrepancies confirm that the distance 
between the stock markets of old EU member countries and the relatively new ones 
is still persistent. 

Summing up, it could be concluded that the capital market integration is a slow 
process. It can be expected that further integration would enhance the development 
of international financial sector (Pagano, 1993). However it can also make interna- 
tional portfolio diversification less attractive as a result of gradual equalisation of 
returns across countries (Kearney & Lucey, 2004). 
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Empirical Examination of Eundamental 
Indexation in the German Market 



Max Mihm and Hermann Locarek-Junge 



Abstract Index Funds, Exchange Traded Funds and Derivatives give investors easy 
access to well diversified index portfolios. These index-based investment products 
exhibit low fees, which make them an attractive alternative to actively managed 
funds. Against this background, a new class of stock indices has been established 
based on the concept of “Fundamental Indexation”. The selection and weighting of 
index constituents is conducted by means of fundamental criteria like total assets, 
book value or number of employees. This paper examines the performance of fun- 
damental indices in the German equity market. For this purpose, a backtest of five 
fundamental indices is conducted over the last 20 years. Furthermore the index 
returns are analysed under the assumption of an efficient as well as an inefficient 
market. Index returns in efficient markets are explained by applying the three factor 
model for stock returns of Fama and French (J Financ Econ 33(l):3-56, 1993). The 
results show that the outperformance of fundamental indices is partly due to a higher 
risk exposure, particularly to companies with a low price to book ratio. By relaxing 
the assumption of market efficiency, a return drag of capitalisation weighted indices 
can be deduced. Given a mean-reverting movement of prices, a direct connection 
between market capitalisation and index weighting leads to inferior returns. 

Keywords Fundamental indexation • Market efficiency ■ Passive investments. 



1 Introduction 

Traditional stock market indices weight companies by means of market capital- 
isation, mostly corrected by a free float factor. A low index turnover and good 
investability qualify these indices as suitable underlyings for index replicating funds 
or derivatives. Furthermore cap weighted indices are essential benchmarks since 
they reflect the average return of a certain stock market. Theoretical legitimation 
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is provided through the Capital Asset Pricing Model (CAPM), which can lead to 
the conclusion that the market cap index portfolio is the only efficient composition 
of risky assets.' So there are several reasons for the dominance of traditional stock 
indices during the last decades. 

However, a new index approach named Fundamental Indexation triggered a 
lively and controversial debate among the fund and indexing industry. The con- 
stituents of fundamental indices are selected and weighted according to fundamental 
factors like revenues or total assets just to mention a few. Amott, Hsu & Moore 
(2005) show that these indices outperform traditional cap weighted indices by 1 .66- 
2.56 percentage points annually over a 43-years period. While critics of fundamental 
indices say the return advantage is delivered through higher systematic risks, advo- 
cates explain superior performance by inefficient markets and a return drag of 
traditional index approaches. 

In this paper hve different fundamental indices that differ in terms of their selec- 
tion and weighting criteria are calculated and compared against a cap weighted and 
an equal weighted index in the German market over the last 20 years. The empirical 
results are presented in Sect. 3. The cap weighted index represents the traditional 
approach to construct a stock index, while the equal weighted index shows the dif- 
ferences of fundamental indices to a naive investment strategy. Furthermore, the 
empirical data is used to explain performance differences. In Sect. 4 the analysis is 
conducted under the assumption of an efficient as well as an inefficient market. 



2 Data and Index Methodology 

The data set consists of all German companies registered in the Thompson Financial 
Database and covers the period 1 January 1988 to 23 July 2007. In case of more than 
one share classes, the one with the biggest capitalisation is included in the index 
universe. Index rebalancing is performed annually, while only those companies with 
a track record of at least 1 year are taken into account. 

All indices encompass 100 companies in total and are constructed as total return 
indices. The base value as at 1 January 1988 is 100. In addition to the traditional 
market cap weighted index (MKIOO) there is one equal weighted index (GGIOO) as 
well as five fundamental indices based on the criteria revenues (UMIOO), number 
of employees (MAIOO), total assets (GKIOO), book value of equity (EKIOO) and 
dividend payment (DVIOO). According to these criteria the index portfolio is rebal- 
anced annually. Table 1 indicates the index methodology by showing the selection 
and weighting criteria of each index variant. All indices except the cap weighted 
approach are weighting their index constituents market independently, that is there 
is no direct link between price and weight. For the GKIOO Index, the index universe 
excludes hnancials and insurance companies. The correction is conducted to foster 
sector diversification, because the index criterion is strongly sector dependent. 



’ The assumptions that are necessary to draw this conclusion are not discussed here. 
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Table 1 Index methodology 



Index 


Selection criteria 


Weighting criteria 


MKIOO 


Market capitalisation 


Market capitalisation 


GGIOO 


Market capitalisation 


Equal weighted 


UMIOO 


Revenues 


Revenues 


MAIOO 


Number of employees 


Number of employees 


GKIOO 


Total assets 


Total assets 


EKIOO 


Book value 


Book value 


DVIOO 


Dividend payment 


Dividend payment 



Table 2 Index performance 



Index 


Index value 
23/07/2007 


Geom. 
mean return 
(%) 


Geom. mean 
return after 
costs (%) 


Standard 

deviation 

(%) 


Sharpe 

ratio 


Jensen 

alpha 


Treynor 

ratio 


MKIOO 


732.01 


10.08 


9.84 


17.45 


0.415 


- 


0.0724 


GGIOO 


930.05 


11.61 


10.89 


12.06 


0.658 


3.06 


0.1239 


UMIOO 


1,310.88 


13.30 


12.82 


17.11 


0.602 


3.37 


0.1089 


MAIOO 


1,319.72 


13.30 


12.79 


17.53 


0.587 


3.36 


0.1070 


GKIOO 


1,505.98 


14.13 


13.63 


17.32 


0.632 


4.16 


0.1161 


EKIOO 


1,134.56 


12.58 


12.09 


17.83 


0.543 


2.29 


0.0962 


DVIOO 


1,116.62 


12.52 


11.71 


16.14 


0.567 


2.71 


0.1019 



3 Results 

The index return statistics summarised in Table 2 show that the cap weighted index 
approach exhibits the lowest historical returns. Fundamental indices realise geomet- 
ric mean returns that exceed the market return (MKIOO) by 2.44-4.05 percentage 
points annually. As at 23 July 2007 the value of a portfolio that rebalances index 
constituents by means of total assets (GKIOO) instead of market capitalisation is 
more than twice as high, regardless of rebalancing costs. 

The rebalancing costs mainly consist of transaction costs, but also holding costs, 
information costs or tax inefficiencies. The higher cost efficiency of the MKIOO 
index partially offsets its return disadvantage. As Table 2 shows, the superior 
returns of the equal weighted and fundamental indices persist after taking costs into 
account.^ 

With returns that exceed the market cap weighted index returns, while having 
similar standard deviations, the Modern Portfolio Theory suggests a dominance 
of fundamental indices. All performance measurements indicate that fundamental 
indices exhibit a superior performance over the last 20 years. The same applies for 
the equal weighted index, based on its low standard deviation. 

In order to assess the significance of higher returns, the rolling return difference 
of each index to the cap weighted index (MKIOO) is analysed. Table 3 summarises 



^ Index replication costs are caleulated based on index turnover, assuming a cost rate of 2%. 
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Table 3 Return differences to MKIOO 



Panel A: Time horizon 1 year', 19 observations 


Index 


Mean (ppt) 


Standard deviation (ppt) 


t-value Positive outperformance (%) 


GGIOO 


0.71 


8.27 


0.36 


52.63 


UMIOO 


3.06** 


7.16 


2.41 


73.68 


MAIOO 


3.07** 


7.07 


2.68 


68.42 


GKIOO 


3.72** 


7.63 


2.81 


73.68 


EKIOO 


2,44*** 


3.23 


4.48 


84.21 


DVIOO 


1.92* 


6.86 


1.56 


68.42 




Panel B: 


Time horizon 1 week, 1017 observations 




Index 


Mean (ppt) 


Standard deviation (ppt) 


t-value Positive outperformance (%) 


GGIOO 


0.008 


1.069 


0.25 


49.75 


UMIOO 


0.056*** 


0.636 


2.82 


56.15 


MAIOO 


0.058*** 


0.701 


2.66 


54.18 


GKIOO 


0.071*** 


0.756 


2.98 


56.05 


EKIOO 


0.044*** 


0.430 


3.30 


55.95 


DVIOO 


0.037** 


0.584 


2.04 


55.75 



*, ** or *** indicate that the null hypothesis of an arithmetic mean of zero is rejected with a 10%, 
5% or 1% level of significance, respectively 



the return discrepancies for different time horizons. The share of superior returns, 
which is stated in the last column and the statistical t-value measures the signif- 
icance of superior returns of each index. Taking the share of superior returns for 
fundamental indices into account, we observe a strong influence of time horizon on 
the significance for superior returns of fundamental indices. For longer time hori- 
zons the share increases strongly. As Panel B shows, fundamental indices exhibit 
superior returns in comparison to the MKIOO index in 54-56% of the cases, assum- 
ing a holding period of one week. If that holding period is extended to 1 year, the 
chance of beating the market with a fundamental index increases to 84% for the 
EKIOO index! Due to a lack of data for long term holding periods, the t- values 
decrease for most fundamental indices. This time horizon effect does not apply for 
the equal weighted GGIOO index. The t- values do not suggest a significant return 
difference between the GGIOO and the MKIOO index for any holding period that we 
observed. 

The authors think that the time horizon effect may be due to the long term mean 
reversion of prices towards fair value. The price independent weighting mechanism 
benefits from such a price process.^ Interestingly, the equal weighted index is not 
capable of capturing any return advantage from noisy prices. Arnott, Hsu, and West 
(2008) state that the return difference is due to the fact that the equal weighted 
index is simply reweighting the companies that are selected by market capitali- 
sation. The difference should disappear if the fundamental index constituents are 
equally weighted. Further research is necessary to identify the main drivers of the 
time horizon influence on return differences. 



^ See Sect. 4.2 for further explanation. 
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The different performance characteristics are certainly caused by different index 
compositions. Fundamental indices are characterised by a tilt towards companies 
with low valuation levels,"^ while the equal weighted index puts emphasis on com- 
panies that are small in terms of market capitalisation. This strategic tilt causes 
most of the return difference between fundamental and equal weighted indices to 
the average stock market, represented by the capitalisation weighted index. 

The calculated fundamental indices feature superior performance characteristics 
in the German stock market of the last 20 years. Thus the empirical results con- 
firm findings of former studies on fundamental indexation.^ Past performances do 
not allow any conclusions on future developments. However, by explaining returns 
through a reasonable model, under the assumptions of the model predictions can be 
made. 



4 Analysis 

The index returns are analysed in two ways. Firstly, an efficient market is presumed. 
Under this assumption the three factor model based on Fama and French (1993) 
is used to explain returns based on the risk exposure of each index. The second 
approach looses the assumption of an efficient market and implies a mean reverting 
price process. Such a price process can explain return discrepancies of the calculated 
index methodologies. The author considers both perspectives since none of them can 
be negated. The fair value of a company is unknown and can only be approximated 
by models (like the one which is used in Sect. 4.1). Every test of market efficiency 
implies a joint hypothesis which does not allow unambiguous conclusions.® 



4.1 Efficient Market 



A multiple regression based on the three factor model shows the risk exposure of 
the index portfolios. The model takes three systematic risks into account. Besides 

f 

the market excess return (r'" — r ( ) that is also incorporated in the CAPM, there 
is a SMBt (small minus big) and a HML, (high minus low) factor. They refer 
to the higher risk of small firms and companies with low valuation levels respec- 
tively. Equation (1) illustrates the regression equation that is used to calculate the 
risk exposures for each index i 



r‘ - r{ = a' + - r/ ) + * SMB, + ^ 



HML 



hml, + €\. (1) 



^ Valuation levels are measured by the revenues to market cap ratio and the book value of equity to 
market cap ratio. 

^ See Lowry (2007), Amott, Hsu & Moore (2005) or Wood & Evans (2003) for the US market. For 
studies on other markets compare Hemminki & Puttonen (2008). 

® Besides the market efficiency, it is always the pricing model that is tested. 
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Table 4 Regression Parameters 



i 


a' 




P'sMB 


^'hml 




MKIOO 


0.0056% 


0.980*** 


—0.046*** 


-0.008 


0.985 


GO 100 


0.0248 


0.773*** 


0.227*** 


0.099*** 


0.878 


UMIOO 


0.0299 


0.924*** 


-0.002 


0.251*** 


0.940 


MAI 00 


0.0268 


0.988*** 


0.069** 


0.225*** 


0.925 


GKIOO 


0.0497** 


0.954*** 


0.044 


0.187*** 


0.900 


EKIOO 


0.0236* 


0.937*** 


-0.089*** 


0.206*** 


0.971 


DVIOO 


0.0295* 


0.853*** 


—0.050* * 


0.195*** 


0.944 



*, ** or *** indicate that the null hypothesis of a regression parameter of zero is rejected with a 
10%, 5% or 1% level of significance, respectively 



0.25% 



0 . 20 % 



0.15% 
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S 
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■ rm-rf 
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-0.05% - 

Fig. 1 Perfoimance attribution 



The regression parameters are summarised in Table 4. The equal weighted index 
has a high risk exposure to the SMB factor due to its bias towards small cap com- 
panies. Fundamental indices exhibit relative high risk exposures to companies with 
low valuation levels, expressed by the HML factor beta. 

The impact of risk exposure on index returns is illustrated in Fig. 1. Due to a 
tiny SMB risk premium in the German stock market, the SMB risk factor barely 
contributes to index returns. While the MKIOO index returns are almost completely 
based on market risk, the fundamental indices benefit from their exposure to value 
companies expressed through the HML factor. 

However, a signihcant part of the superior returns of fundamental indices are not 
explained by systematic risk factors. There is a high regression alpha which con- 
tributes to the outperformance of fundamental indices without being based on any 
kind of systematic risk. Importantly within the model assumptions this proportion 
of the return can not be considered as persistent since in an efficient market there 
is no sustainable return without systematic risk. This conclusion is based on the 
presumptions that the market is efficient and the model incorporates all risk factors 
and hence calculates fair values. 
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4.2 Inefficient Market 

In inefficient markets prices differ from fair values. The likelihood that market 
participants identify irrational prices increases with the extent of mispricing. Conse- 
quently, prices revert to their fair value in inefficient markets, which lead to a mean 
reverting price process or so called noisy prices.^ So there are times of overpricing 
and times of underpricing. 

The price of an asset, as well as its return, can he broken down to a value compo- 
nent and a mean reverting component. The value component of a price is what one 
would assume to prevail in an efficient market. The model used in Sect. 4. 1 gives 
an approximation for the value component of the return of a company, based on its 
systematic risk factors. 

Given the fact that prices are mean reverting, Treynor (2005) shows that a return 
drag of market cap weighted indices can be deduced. Certainly there is no way to 
definitely identify over- and undervalued companies. However, there is a system- 
atic failure in the capitalisation weighted index methodology. Since the weighting 
is dependent on market valuation, a positive mispricing will automatically lead to 
a higher weight in the index. Consequently, compared to market independently 
weighted indices, traditional market cap weighted indices relatively overweight 
overpriced companies and vice versa relatively underweight underpriced compa- 
nies. Regardless of differences in exposure to systematic risks, whenever prices 
are mean reverting, the market cap weighted index is expected to exhibit inferior 
returns in comparison to fundamental or equal weighted indices. This return differ- 
ence is independent from the value component of returns, which is determined by 
systematic risks. 

Under the assumption of an inefficient market with mean reverting prices, the 
return advantage of fundamental indexes that is not based on systematic risk fac- 
tors can be considered as persistent. Assuming that the market is inefficient and the 
model used in Sect. 4.1 is correct, the return advantage expressed by the regression 
alpha may be explained through the return drag of the MKIOO index. This conclu- 
sion differs fundamentally from the implications in an efficient market where the 
regression alpha was considered as a non-sustainable part of the index returns. 



5 Conclusion 

The empirical results as well as the conclusions of the analysis indicate that fun- 
damental indices are contributing to the universe of passive investment products. 
Without being actively managed, fundamental indices capture a value premium 



^ For a valuable discussion about the origins and implications of noise see Black (1986). 
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through placing emphasis on companies with low valuation levels. The regression 
shows that this characteristic significantly accounts for the superior performance of 
fundamental indices. 

Apart from the risk exposure to value companies, there is an additional return 
advantage for fundamental indices when prices are mean reverting. This advantage 
is due to the fact that fundamental indices cut the link between market valuation 
and index weighting. By doing this the overpriced companies are not systematically 
overweighted and vice versa. This conclusion leads to the fact that fundamental 
indices deliver a real added value since investors do not have to bear any additional 
risk to justify their extra return. 

Advocates of the efficient market hypothesis may argue that the applied three 
factor model does not incorporate all relevant risk factors and thus calculates false 
returns. However the authors favour the explanation from Sect. 4.2 that bases on the 
notion that markets are inefficient and prices are characterised by mean reversion 
towards fair value. We do this for two reasons. 

First, the assumption of noisy prices contribute to the size and value puzzle 
by explaining these market anomalies through mean reversion of prices. Arnott & 
Hsu (2008) develop a model that shows that the value and size effect are driven 
by noisy prices. This explanation seems to be more realistic, bearing in mind that 
the interpretation of a higher systematic risk of small cap and value companies is 
doubtful.^ 

Second, there are plenty empirical examples as well as theoretical arguments that 
foster the notion of irrational prices. The tech bubble that burst in 2000 is the most 
prominent historical example for exuberance in stock markets. Byrne & Brooks 
(2008) summarise current findings in the field of behavioural finance that provide 
us with multiple reasons why prices deviate from fair value. 

Therefore we believe that a more realistic description of stock returns may be 
obtained by incorporating noise and mean reversion in the price process rather 
than assuming a random walk. To what extent the return advantage of fundamen- 
tal indices can be explained by noisy prices has to be researched in greater detail. 
Fama & French (2007) develop an approach to split stock returns in three compo- 
nents based on the development of dividends, book value and valuation level. They 
find that the price to book ratio’s convergence over time is responsible for return 
differences of growth and value stocks. The finding that valuation levels are mean 
reverting over time is one indication that there is a more realistic way to explain 
well known market anomalies and by doing this explain return differences between 
fundamental indices and traditional market cap weighted indices. More importantly 
this explanation of stock returns may lead to a new way of portfolio construction, 
without weighting companies by means of market capitalisation. 



° For a discussion on the lack of empirical evidence for a higher systematic risk of value companies 
see for example Chan & Lakonishok (2004). 
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The Analysis of Power for Some Chosen VaR 
Backtesting Procedures: Simulation Approach 



Krzysztof Piontek 



Abstract Everyone who measures the market risk using the Value at Risk {VaR) 
approach should test if the assumed model is correct. This procedure is called back- 
testing. There are many different tests available, but usually risk managers are not 
concerned about their power. 

The aim of this paper is to analyze some chosen backtesting methods focusing 
on the problem of power of the tests and limited data sets. 

The paper is organized as follows. At the beginning a financial aspect of the ana- 
lyzed problem is presented very briefly. The second part gives information about 
some chosen, but (in the author’s opinion) the most popular backtests. The main 
attention is paid to tests based on the frequency of failures and on multiple VaR 
levels. Next, the results of the simulations are presented. The last part summarizes 
obtained results and gives hints for the optimal backtesting. 

Keywords Backtesting • Power of tests ■ Risk measurement ■ Value at risk. 



1 Introduction 



Value at Risk is one of the most popular risk measures used by financial insfifufions. 
The definition of VaR is, however, quite general (Jorion, 2001; Piontek, 2007): 

Value at Risk is such a loss in market value of a portfolio that the probability that it occurs 
or is exceeded over a given time period is equal to a prior defined tolerance level q. 

The sense of this definition is presented below: 

P {W <Wo-VaR{q)) = q, (1) 

where: Wo is a value of a financial instrument (portfolio) at present, W is a value 
at the end of the investment horizon (random variable) and qr is a prior defined 
tolerance level. 
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Put differently: let qhe, diVaR tolerance level, is a number of days in a given 
investment horizon (called VaR forecast horizon) and x = 100(1 — q), then we are 
in x% sure that we will loose not more than VaR (in monetary terms) during the 
nearest N days. 

Usually, econometric models are not expressed in terms of values but in terms of 
returns: 



P {r, < F-^ (q)) = q, VaR,-, {q) = -F ~/ , (2) 

where is a rate of return (periodic or logarithmic) and F~^ is a quantile of loss 
distribution related to the probability of 1 — ^ (Piontek, 2007). 

However, this definition does not inform in what way a VaR measure should 
actually be estimated. Because of it, there are many approaches which can give 
different values. 

The most popular methodologies for calculating VaR are (Jorion, 2001): the 
historical simulation, stochastic (Monte Carlo) simulation, variance-covariance 
approach, group of methods using the quantile of non-normal distribution, extreme- 
value-theory approach. 

The risk managers, however, never know a priori, which approach or model 
will be the best or even correct, and should use several models and then backtest 
them. The validation of risk models, during the backtesting procedure, should be 
the critical issue in the acceptance of internal models: 

Backtesting is an ex-post comparison of a risk measure generated by a risk model against 
actual changes in portfolio value over a given period of time, both to evaluate a new model 
and to reassess the accuracy of existing models (Jorion, 2001). 

In case of value at risk this means that a series of VaR forecasts made for subse- 
quent days is compared with an empirical time series of returns (spanning the same 
period). In most backtesting procedures there are analyzed instances that empirical 
loss exceeds corresponding VaR estimate. Considered methods analyze frequency 
(test statistics with an index uc), time-dependence/independence (test statistics with 
an index ind) or both of them (test statistics with an index mix) of such exceedances. 

Nowadays, in the author’s opinion, the challenge is not to suggest a new method 
of VaR measuring but to distinguish between correct and incorrect models. Although 
the issue is important, no single backtesting technique has been established until 
now. 

The most popular tests for validation of VaR models can be classified into three 
groups (Piontek, 2007; Hass, 2001; da Silva, da Silveira Barbedo, Araujo, & das 
Neves, 2005): 

1. Those based on the frequency of failures (Kupiec, Christoffersen) (Jorion, 2001; 
Kupiec, 1995), 

2. Those based on the adherence of a VaR model to the asset return distribution (on 
multiple VaR levels) (Crnkovic-Drachman, Berkowitz) (Berkowitz, 2000), 

3. Those based on various loss functions (Lopez, Sarma) (Lopez, 1998). 
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In the further part of this paper, the empirical examination is carried out only for 
some methods from the first and second group. If one is interested in an automiza- 
tion of backtesting, the third approach is probably the least suited and not so popular 
as others. For this reason this group is not analyzed in the empirical part of this work. 



2 Tests Based on the Frequency of Failures 



The most popular tool for validation of VaR models (for the length of the backtest- 
ing time period equal to T units) is the failure (or hit) process [4(^)]*zf . The hit 
function is defined as follows (Kupiec, 1995; Jorion, 2001): 



) 1; r, < F Mq) if a violation occurs 

1 

0; ri>F^j^(q) if no violation occurs 

and tallies the history of whether or not the exceptions have been realized. 

Kupiec proposed the proportion-of-failures test, which is probably the most com- 
mon one. Here it is examined how many times a financial institufions VaR is violafed 
over a given span of time (Kupiec, 1995; Jorion, 2001; Hass, 2001; da Silva et al., 
2005; Piontek, 2007). This test analyzes the unconditional coverage (uc) property of 
the hit sequence. The null hypothesis for this test is that the empirically determined 
probability matches the given tolerance level of VaR: 

Ho:q = q. (4) 



The test statistic is based on the likelihood ratio given by the formula (5) and it is 
asymptotically chi-square distributed with one degree of freedom. 



where: 



lr!. 



-2 In 



/ (i-g)N^‘ \ 



xl 



(5) 



q = 



Ti 

Ta + T,' 



Tx 



T 

Y,It(q), To = T-Tx. 

r=l 



(6) 



If the VaR model is reliable, the exceptions should not follow any pattern, such 
as clustering, for example Jorion (2001); Campbell (2005); Hass (2001). 

The most popular method for examining the independence of exceptions (ind) is 
the Christoffersen test (Jorion, 2001; Piontek, 2007), given by the formula: 

^^T’oo-t-rio^T’oi-l-T’ii 

(i-^oi)^»<r(i-?ii)^Nrr 




LRf,l', = -2ln\ 



(7) 
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where: 




Too + 7oi + T’lO + T’li 



7oi + T’li 



( 8 ) 



and Tij is a number of i values followed by a j value in the hit series. 

This test examines whether or not the likelihood of VaR violations depends on 
whether or not a VaR violation occurred on the previous day: 



Although Christoffersen criticizes first order Markovian process as a limited alterna- 
tive compared to other forms of clustering (Piontek, 2007), the presented approach 
is easy to implement and, because of this, it is still the most popular one. It is also 
independent of the frequency-of-failures. 

It is important to recognize that the unconditional coverage and independence 
property of the hit sequence are separate and distinct and must both be satisfied by a 
correcf VaR model (Jorion, 2001). The fesf that jointly examines the both properties 
has been proposed by Christoffersen and it is usually called the mixed {mix) test 
(Jorion, 2001; Campbell, 2005; Piontek, 2007): 



Some discussions about joint tests might seem to suggest that joint tests are uni- 
versally preferable to test of either the unconditional coverage property or indepen- 
dence property. But this is not true. The fact that one property is satisfied makes if 
more difficult for a joint test to detect the other inadequacy of the VaR measure. 

3 Tests Based on Multiple VaR Levels 

All the backtests that have been discussed in Sect. 2 have focused on examining 
the behaviour of the hit sequence. Despite the fact that the hit function plays the 
prominent role in a variety of backtesting procedures, the information contained in 
the hit sequence is limited. 

There is no need to restrict attention to a single VaR level. The unconditional 
coverage and independence properties of correct VaR model should hold for any 
tolerance level q, so backtest procedures based on multiple VaR levels have been 
also suggested (Berkowitz, 2000; da Silva et al., 2005). They examine the deviation 
of the empirical return distribution from the theoretical model distribution (Piontek, 
2007). Usually (Berkowitz, 2000), the observed portfolio returns Ft are transformed 
into a series [see (11)], where F denotes the ex ante forecasted return distribution 
function (conditional or not). 



Ho : qo\ = qn = q. 



(9) 




( 10 ) 



ut = F {vt) = f{y)dy 




— OO 



( 11 ) 
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If the model is well calibrated it is expected that: 

• The series m, should be uniformly distributed over the unit interval [0,1] (this 
property is a direct parallel to the unconditional coverage). 

• The series should be independently distributed (it is analogous to the statement 
that VaR violations should be independent from each other). 

These two properties are combined into the single statement: m, ~ i.i.d.U(0, 1). 

A wide variety of tests using these conditions have been suggested. Some of 
them are based on the distance between the observed Ut series distribution and the 
theoretical uniform distribution (Piontek, 2007; Hass, 2001). However, the approach 
which becomes more and more popular is the transformation of Ut series into zt 
series based on the inverse normal distribution function <I>“^(-): 

Z, = <p-'(Mr). (12) 

Under the null hypothesis that the VaR model is correct, the zt series should be 
independent and identically distributed standard normal random values. 

Now it is easy to construct a quite powerful likelihood ratio tests (Berkowitz, 2000; 
da Silva et ah, 2005). 



Zt- ii = p{zt-\ - /tt) + St, var(e,) = a^, (13) 

Ho = [0,1,0). (14) 

A restricted likelihood can be evaluated and compared to an unrestricted one 
for testing analyzed properties. The properties of unconditional coverage and inde- 
pendence may be tested separately or jointly. Under the null hypotheses these test 
statistics are distributed like a chi-square with a corresponding number of degrees 
of freedom: 



LR^^^ = 2 [LLFifi, a,p)~ LLF[0, 1 , p)j -- 




(15) 


LRl, = 2 [LLF(A, a, p) - LLFifi, a, 0)] ^ 




(16) 


= 2 [LLF(fi, a,p)- LLF(0, 1,0)]^ 




(17) 



Backtesting Errors 

Tests Presented before are, usually, used for evaluating internal VaR models devel- 
oped by financial institutions. One should be, however, aware of the fact that two 
types of errors can occur: a correct model can be rejected or a wrong one may be 
not rejected (Jorion, 2001). 
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All tests are designed for controlling the probability of rejecting the VaR model 
when the model is correct. It means that the type I error is known. This type of 
wrong decisions leads to the necessity of searching for another model, which is just 
wasting time and money. But the type II error (acceptance of the incorrect model) is 
a severe misjudgement because it can result in the use of an inadequate VaR model 
that can lead to substantial negative consequences. 

Performance of the selected tests needs to be analyzed with concern to the type II 
error, in order to select the best one for different (but small) numbers of observations 
and model misspecifications. 



4 Empirical Research: Simulation Approach 

For evaluating the power of the tests it is necessary that the properties of the asset 
return generation process are well known. This data generating process can differ 
from the probability distribution of the assumed VaR model. 

We assume that the generated returns follow standardized Student distribution 
with the number of degrees of freedom between about 3 and 25. The returns can be 
independent of each other or not. 

The VaR model is also based on the standardized Student distribution, but the 
number of degrees of freedom is equal to 6. So, it can be incorrect, which leads to 
the unconditional coverage and independence inaccuracy. On this ground we can 
test the power of backtests. 

The data series of the length of 100, 250, 500, 750 and 1,000 observations were 
simulated. For each kind of inaccuracy of the model and for each of the specified 
lengths of the data series Monte Carlo simulations with 10,000 draws were done. It 
allowed for calculating test statistics and for estimation of the frequency at which 
the null hypotheses were rejected for incorrect models. The last may be treated as 
an approximation of the test power. 

Tables 1 and 2 present the summary results of the Kupiec approach. The central 
column (for 0.05) represents the type I error, other columns - the power of the test 
for given strength of inaccuracy (incorrect frequency of failures). 

When we use the chi-square distribution for determining the critical value (CV) 
for the typical length of series we can be wrong because of two reasons: the assump- 



Table 1 Kupiec test, q = 0.05, a = 0.05, C V based on x\ 



Number Frequency of failures 



of obs. 


0.030 


0.035 


0.040 


0.045 


0.050 


0.055 


0.060 


0.065 


0.070 


100 


0.1955 


0.1339 


0.0940 


0.0719 


0.0653 


0.0725 


0.0927 


0.1249 


0.1680 


250 


0.3751 


0.2263 


0.1279 


0.0744 


0.0585 


0.0757 


0.1242 


0.2015 


0.3018 


500 


0.6656 


0.4180 


0.2164 


0.0975 


0.0539 


0.0736 


0.1534 


0.2876 


0.4554 


750 


0.8068 


0.5321 


0.2629 


0.1016 


0.0537 


0.1021 


0.2420 


0.4476 


0.6600 


1,000 


0.9142 


0.6743 


0.3512 


0.1269 


0.0514 


0.1015 


0.2711 


0.5182 


0.7493 
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Table 2 Kupiec test, q = 0.05, a = 0.05, C V based on simulations 



Number Frequency of failures 



of obs. 


0.030 


0.035 


0.040 


0.045 


0.050 


0.055 


0.060 


0.065 


0.070 


too 


0.1948 


0.1320 


0.0894 


0.0626 


0.0486 


0.0456 


0.0528 


0.0699 


0.0968 


250 


0.3751 


0.2259 


0.1263 


0.0695 


0.0462 


0.0512 


0.0829 


0.1414 


0.2247 


500 


0.5681 


0.3238 


0.1519 


0.0635 


0.0395 


0.0685 


0.1519 


0.2872 


0.4553 


750 


0.8068 


0.5321 


0.2627 


0.1000 


0.0458 


0.0789 


0.1982 


0.3902 


0.6054 


1,000 


0.8838 


0.6114 


0.2920 


0.0967 


0.0419 


0.0995 


0.2708 


0.5182 


0.7493 


Table 3 


Kupiec test, q = 0.01, a = 0.05, CV based on /j 










Number 








Frequency of failures 








of obs. 


0.006 


0.007 


0.008 


0.009 


0.010 


0.011 


0.012 


0.013 


0.014 


100 


0.0032 


0.0055 


0.0087 


0.0130 


0.0184 


0.0250 


0.0328 


0.0420 


0.0525 


250 


0.2230 


0.1748 


0.1386 


0.1124 


0.0948 


0.0847 


0.0815 


0.0845 


0.0934 


500 


0.1993 


0.1382 


0.0986 


0.0769 


0.0709 


0.0788 


0.0996 


0.1321 


0.1751 


750 


0.1730 


0.1053 


0.0647 


0.0445 


0.0408 


0.0523 


0.0787 


0.1198 


0.1749 


1,000 


0.2844 


0.1729 


0.1023 


0.0650 


0.0551 


0.0696 


0.1073 


0.1667 


0.2446 


Table 4 


Kupiec test, q = 0.01, a = 0.05, CV based on simulations 








Number 








Frequency of failures 








of obs. 


0.006 


0.007 


0.008 


0.009 


0.010 


0.011 


0.012 


0.013 


0.014 


100 


0.0032 


0.0055 


0.0087 


0.0130 


0.0184 


0.0250 


0.0328 


0.0420 


0.0525 


250 


0.0009 


0.0021 


0.0043 


0.0081 


0.0137 


0.0217 


0.0326 


0.0466 


0.0639 


500 


0.0496 


0.0308 


0.0207 


0.0173 


0.0198 


0.0285 


0.0440 


0.0670 


0.0979 


750 


0.1730 


0.1053 


0.0647 


0.0445 


0.0408 


0.0523 


0.0787 


0.1198 


0.1749 


1,000 


0.2843 


0.1724 


0.1002 


0.0593 


0.0425 


0.0461 


0.0692 


0.1117 


0.1730 



tion of asymptotic convergence of the test statistic is not met and the test statistic is 
not continuous but discrete. We have to note this. However, as it has been checked, 
it did not make a big difference if chi-square or simulated critical values were used 
for the tolerance level of 5%. 

What we see is that the power of the test is rather low. For example, in the case 
with 250 observations, an inaccurate model giving 3% or 7% of violations, instead 
of 5%, was rejected only in about 35% of draws. So, in 65% of cases we did not 
reject the wrong model at 5% significance level. 

For the tolerance level of 1 % the results are presented in Tables 3 and 4. It turns 
out that the asymptotic property of the test statistic might be a serious problem. We 
obtain some untypical and unexpected values of type I error and incorrect results of 
power of test (e.g., for 250 observations). 

If we use the critical values obtained by simulations we observe that the power 
of the test is getting even worse. But regardless of that - the results indicate that 
the test based on failure proportion is not adequate for small samples and even for 
1,000 observations. Even if we observe 40% more or less exceptions than we expect, 
the power of the test is about 20-25%. It is very bad news. There is a significant 
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Table 5 Berkowitz test (unconditional), q = 0.05, a = 0.05, C V based on 



Number 
of obs. 








Frequency of failures 








0.030 


0.035 


0.040 


0.045 


0.050 


0.055 


0.060 


0.065 


0.070 


100 


0.2242 


0.1490 


0.0896 


0.0604 


0.0474 


0.0590 


0.1372 


0.2638 


0.4464 


250 


0.5866 


0.3418 


0.1732 


0.0730 


0.0460 


0.0908 


0.2412 


0.5244 


0.7708 


500 


0.9190 


0.6562 


0.3170 


0.1094 


0.0478 


0.1310 


0.4162 


0.7812 


0.9576 


750 


0.9864 


0.8582 


0.4748 


0.1514 


0.0480 


0.1842 


0.5780 


0.9148 


0.9920 


1,000 


0.9990 


0.9464 


0.6188 


0.1792 


0.0494 


0.2150 


0.7006 


0.9666 


0.9970 


Table 6 


Berkowitz test (unconditional). 


q = 0.01, a = 0.05, CV based on 






Number 








Frequency of failures 








of obs. 


0.006 


0.007 


0.008 


0.009 


0.010 


0.011 


0.012 


0.013 


0.014 


100 


0.0785 


0.0686 


0.0575 


0.0504 


0.0490 


0.0551 


0.0655 


0.0709 


0.0829 


250 


0.1246 


0.0867 


0.0629 


0.0511 


0.0521 


0.0539 


0.0714 


0.0941 


0.1291 


500 


0.2281 


0.1383 


0.0805 


0.0615 


0.0487 


0.0583 


0.0846 


0.1387 


0.1967 


750 


0.3364 


0.1948 


0.0979 


0.0629 


0.0475 


0.0683 


0.1054 


0.1744 


0.2779 


1,000 


0.4469 


0.2460 


0.1265 


0.0648 


0.0500 


0.0680 


0.1297 


0.2239 


0.3553 



probability of not rejecting the null hypothesis when it is false. The Kupiec test 
should not be used for VaR models with tolerance level of 1 % for typical length of 
the observed series. 

The power of the Berkowitz test for unconditional coverage was also examined. 
The results are presented in the Tables 5 and 6. 

The power of the Berkowitz test for the tolerance level of 5% is higher comparing 
to the Kupiec test, but only for the longer series and stronger inaccuracies the power 
of the Berkowitz test could be acceptable for risk managers. For VaR tolerance level 
of 1%, again, the Berkowitz test has a higher power than the Kupiec test, however, 
the power of this test is, in author’s opinion, not sufficient for risk managers for 
these typical lengths of data series. 

We can summarize that: 

• The more incorrect VaR models - the bigger superiority of the Berkowitz test 
against the Kupiec test. 

• For the VaR tolerance level equal to 0.05 the superiority of the Berkowitz test can 
be observable for all lengths of the return series. 

• For the VaR tolerance level of 0.01 the superiority of the Berkowitz test can be 
observable for the series length of 750 and 1,000 observations. 

• For the shorter series the conclusions are ambiguous. 

We examine both Christoffersen and Berkowitz approaches for testing also the 
simple independence property [see (7) and (16)]. Now the frequency of failures is 
correct for all series, but the exceptions are not independent. For example, for the 
first case in the Table 7 : 



/> (4 = 1) = 0.05 A F (4+1 = 1 14 = 1) = 0.025 or 0.100. 



( 18 ) 
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Table 7 Results for simple Christoffersen and Berkowitz tests of independence 
Christoffersen, q = 0.05 Berkowitz, q = 0.05 



Number 


Freq. of fail. 


Number 


Freq. of fail. 


of obs. 


0.025 


0.100 


of obs. 


0.025 


0.100 


100 


0.0040 


0.0494 


too 


0.0430 


0.0524 


250 


0.0090 


0.0998 


250 


0.0498 


0.0598 


500 


0.0426 


0.1784 


500 


0.0474 


0.0660 


750 


0.1764 


0.2474 


750 


0.0554 


0.0652 


1,000 


0.2440 


0.3012 


1,000 


0.0592 


0.0702 


Christoffersen, q 


= 0.01 


Berkowitz. q = 


0.01 


Number 


Freq. of fail. 


Number 


Freq. of fail. 


of obs. 


0.005 


0.020 


of obs. 


0.005 


0.020 


100 


0.0016 


0.0082 


too 


0.0508 


0.0542 


250 


0.0076 


0.0276 


250 


0.0522 


0.0522 


500 


0.0066 


0.0352 


500 


0.0516 


0.0498 


750 


0.0096 


0.0440 


750 


0.0444 


0.0486 


1,000 


0.0084 


0.0430 


1,000 


0.0494 


0.0478 



Table 8 Chosen results for mixed inadequacies of the model 

Number cl C2 C3 C4 

of obs. Berkowitz Christoff. B + C mixed B + C separ. 



See (20) See (20) See (20) See (20) 





0.02 


0.08 


0.02 


0.08 


0.02 


0.08 


0.02 


0.08 


100 


0.0890 


0.0951 


0.0029 


0.0334 


0.0491 


0.0673 


0.0915 


0.1261 


250 


0.1687 


0.1696 


0.0039 


0.0713 


0.1141 


0.1628 


0.1716 


0.2309 


500 


0.3235 


0.3215 


0.0053 


0.1206 


0.2777 


0.3364 


0.3282 


0.4087 


750 


0.4637 


0.4823 


0.0391 


0.1665 


0.4442 


0.5035 


0.4951 


0.5781 


1,000 


0.6151 


0.6079 


0.1153 


0.2092 


0.6129 


0.6330 


0.6828 


0.6972 



Some chosen results are presented in the Table 7. In the hrst distinguished case, the 
power of test is low, but for all other cases it is low in an unacceptable way. The 
Berkowitz test has a low power as a test of exception independence, which makes it 
inadequate for this range of applications. The Christoffersen test can be only used 
for the distinguished tolerance level of 5%. 

Finally, we examine the power of the tests if both unconditional coverage and 
Independence properties are not met. The correct model is still given by 

F (4 = 1) = 0.05 A P (4+1 = l|/r = 1) = 0.05. (19) 

The results for the chosen case when: 

P (4 = 1) = 0.04 A P (4+1 = l|/r = 1) = 0.02 or 0.08 (20) 



are presented in the Table 8. 
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The columns Cl and C2 show the results for simple tests, the column C3 the 
results for the mixed test, when the both test statistics are summed, and the last 
column C4 presents results for tests used separately, which means that the correct 
model has to be not rejected by the both tests. We observe that it is the best way for 
increasing the probability of rejecting the incorrect model. 

The examinations were done for different strength of inaccuracies. For each 
incorrect unconditional and conditional probability the conclusions based on the 
obtained results are the same. Because of a limited content of this paper, tables with 
other, less illustrative, results are not presented here. 



5 Some Final Conclusions 

For tolerance level of 5% the best choice is using the Berkowitz and Christoffersen 
test separately. The results are a little better than for the mixed test and somewhat 
better than for the simple Berkowitz test. 

However, for tolerance level of 1 %, the best choice is using just the simple Berkowitz 
test for testing unconditional coverage. In this case, testing for independence is 
ineffective. 

It comes out that it is necessary to determine how low the power of VaR back- 
tests may be in some typical cases. It seems particularly important to discuss the 
acceptable minimum of the test power and to focus on the type II error. 
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Extreme Unconditional Dependence Vs. 
Multivariate GARCH Effect in the Analysis 
of Dependence Between High Losses on Polish 
and German Stock Indexes 



Pawel Rokita 



Abstract Classical portfolio diversification methods do not take account of any 
dependence between extreme returns (losses). Many researchers provide, however, 
some empirical evidence for various assets that extreme-losses co-occur. If the 
co-occurrence is frequent enough to be statistically significant, it may seriously 
influence portfolio risk. Such effects may result from a few different properties 
of financial time series, like for instance: (1) extreme dependence in a (long-term) 
unconditional distribution, (2) extreme dependence in subsequent conditional dis- 
tributions, (3) time-varying conditional covariance, (4) time-varying (long-term) 
unconditional covariance, (5) market contagion. Moreover, a mix of these prop- 
erties may be present in return time series. Modeling each of them requires different 
approaches. It seams reasonable to investigate whether distinguishing between the 
properties is highly significant for portfolio risk measurement. If it is, identifying 
the effect responsible for high loss co-occurrence would be of a great importance. If 
it is not, the best solution would be selecting the easiest-to-apply model. This arti- 
cle concentrates on two of the aforementioned properties: extreme dependence (in a 
long-term unconditional distribution) and time-varying conditional covariance. 

Keywords Extreme dependence ■ Multivariate GARCH ■ TDC. 



1 Introduction 

This paper addresses the problem of high loss co-occurrence. From a risk manager’s 
point of view, if such phenomenon exists in an analyzed portfolio, it requires careful 
treatment. 

From among various approaches that may describe some forms of high loss co- 
occurrence two very different groups of models are paid here particular attention: 



P. Rokita 

Department of Financial Investment and Risk Management, Wroclaw University of Economics, ul. 
Komandorska 118/120, 53-345 Wroclaw, Poland, 
e-mail: pawel.rokita@ue.wroc.pl 



A. Fink et al., (eds.). Advances in Data Analysis, Data Handling and Business 
Intelligence, Studies in Classification, Data Analysis, and Knowledge Organization, 
DOI 10.1007/978-3-642-01044-6-45, © Springer- Verlag Berlin Fleidelberg 2010 



491 



492 



P. Rokita 



(long-term) unconditional distributions with extreme dependence, and heteroscedas- 
tic multivariate stochastic processes whose dependence structures are described with 
classical covariances. 

The first group is represented in this paper by distributions with fat-tailed 
Archimedean copulas and normal margins. The second - by a multivariate GARCH 
model. 

The aim of the research is not to determine which of the two aforementioned 
groups of models better fits the data, but rather to compare their influence on risk 
estimates. 

Since the models originate from completely different approaches, no goodness of 
fit measures are used in the comparison. It seems more promising to estimate Value 
at Risk assuming different underlying models, and then compare back-test results. 
Particular attention should be paid to the cases when the assumed theoretical VaR 
model is simpler than the model from which the pseudo-random sample has been 
generated (esp. if it depends on lower number of parameters). 

Static approaches, even if they assume dependence of extreme values, are usually 
simpler than dynamic models of non-stationary stochastic processes. Particularly, 
employing Archimedean copulas being dependent usually on one parameter (plus 
parameters of the marginal distributions) gives a conveniently small set of param- 
eters to be estimated. For comparison, modeling a two-dimensional BEKK(1,1) 
process requires estimation of 11 parameters. A full VEC-GARCH(1,1) would 
have as many as 21 parameters. Thus, any observation implying that unconditional 
tail dependence may be more important than the dynamics of dependence struc- 
ture would be a good news for a risk modeler trying to simplify VaR estimation 
procedures. 



2 Models to Be Compared 

2.1 Extreme Dependence 

2.1.1 Testing For 

The term “extreme dependence” will be used here over interchangeably with “tail 
dependence” and understood as a dependence between extreme returns (losses). 
Coles, Heffernan and Tawn approach from 1999 (Coles, Heffernan, & Tawn, 1999, 
p. 348) has been applied here. This employs also tail independence test, utilizing the 
notion of asymptotic dependence. 

Asymptotic dependence, in the sense of Sibuya and Joe’s (Sibuya, 1960; Joe, 
1997) definition, proposed then for the needs of financial data analysis by Frahm, 
Junker, and Schmidt (2005, p. 2), occurs if the following conditional probability 
exists in the limit and is higher than zero: 
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Definition 1. Asymptotic dependence 

A„= lim P{X,> F-Hpu)\X 2 > F^^{p,))>0 (1) 

The quantity A„ is called tail dependence coefficient (TDC). Here, to be more 
precise - upper tail dependence coefficient (uTDC). 

It is worth emphasizing that the condition of asymptotic dependence in the tail of 
a multivariate distribution is stronger than just tail dependence (dependence in the 
tail). Thus, asymptotic independence does not imply tail independence. 

Random variables X\ and X 2 are asymptotically independent (in the upper tail) 
when: 



A„ = lim_ P (X, > Fr‘ ip,) \X 2 > Ff' ip,)) = 0 (2) 

Pu 



whereas they would be tail independent if: 

P (X, > F-^ ip ,) , X 2 > Fj-' ip,)) = P{Xi> Ff ' ip,)) 

xP{X2> Ff' ip ,)) . (3) 



Asymptotically independent random variables may be more or less tail depen- 
dent. Therefore, testing for X, iTDC) being equal to zero is insufficient to verify the 
tail independence hypothesis. For that reason Coles, Heffernan and Tawn introduced 
a complementary dependence measure of the following form: 



X, — 



2I0J 

lim ^ 

Pu^i~ log P (Xi 



;(F(Ai >Ff‘ 



jPu))) 



F-l ! n \ 



- - ^ 



Table 1 presents the way parameters X, and X, may be jointly interpreted. 

The pair (A„, A„) may inform, in particular two cases, about: 

• Asymptotic dependence, when (A„ > 0, X, = 1) - then X, measures the strength 
of the dependence, while X, does not bring any additional information. 



Table 1 Joint interpretation of parameters A„ and A„ 



Strength and direction 
of dependence^ 


A„ 




Asymptotic dependence 


Independence 


0 


0 


None 


Positive dependence 


0 < A„ < 1 


0 < A„ < 1 


May occur, but does not have to 


Negative dependence 


— 1 < A„ < 0 


0 < A„ < 1 


May occur, but does not have to 


Full dependence (positive 
or negative) 


-1 or 1 


0 < A„ < 1 


Occurs 



“Refers to values exceeding threshold quantiles 
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• Asymptotic independence if (A„ = 0, < 1) - then A„ measures the strength 

of the dependence, while A„ does not bring any additional information. 

Tail independence test consists in fact in verifying the hypothesis Hq\ A„ = 1, not 
the hypothesis Hq : = 0. Thus, if there are no grounds for rejecting the hypoth- 

esis Hq, there are also no grounds for rejecting asymptotic dependence. On the 
other hand, if the hypothesis : A„ = 1 is rejected for the alternative hypothesis 
H[ Xu ^ I, it is interpreted that the asymptotic dependence may he nonexistent, 
though, it does not automatically mean that the hypothesis Hq : A„ = 0 of asymp- 
totic independence may be accepted. However, because under : A„ = 1 the 
parameter A„ takes on upper limit of its admissible interval, testing for consist 
in estimation of a tail shape parameter of a special aggregated univariate random 
variable called by Coles and Town a structural variable T (Coles & Tawn, 1994, 
p. 23). If its univariate distribution is fat-tailed, then bivariate distribution of the ran- 
dom variables used to construct it is also fat-tailed. The tall of the random variable 
T is modeled using general Pareto distribution (GPD). The estimated tail parameter 
is just the shape parameter ^ of the GPD. If its value is lower than 1, then also A„ 
is lower then 1. The advantage of estimating ^ and drawing on this ground conclu- 
sions about Xu is that the domain of the hrst one is just the set of real numbers. Thus, 
determining confidence interval of ^ estimates is much simpler (it may be estimated 
using ML method and is asymptotically normally distributed). 



2.1.2 Models Used for Simulations 

To generate pseudo-random numbers showing extreme dependence, models with 
Archimedean copulas (Nelsen, 2005, p. 2) are used here. Archimedean copula func- 
tions have been chosen because of their analytic simplicity. Moreover, some of 
them possess the property of tail dependence (see Armstrong, 2003). Copulas of 
the following numbers according to Nelsen’s classification are utilized: 4 (so called 
Gumbel copula), 12 and 14. 

It needs to be emphasized, that it does not limit the generality of further investi- 
gation with empiric data. Asymptotic dependence tests, as well as three out of four 
TDC estimators, that are used in the research (Sect. 3), do not assume any specihc 
parametric form of the dependence structures. 



2.2 Varying Conditional Covariance 

Here, by a model with varying conditional covariance it will be meant a stochastic 
process of returns, whose conditional covariance matrix {Ht) follows a multivariate 
GARCH. In general, a multivariate GARCH model may be described by Bollerslev, 
Engle and Wooldridge’s VEC-GARCH (Bollerslev, Engle, & Wooldridge, 1998). 
Due to its practical intractability (vast number of parameters and problems with 
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fulfilling the condition that Ht is positive definite), one of its most popular sim- 
plification, BEKK {Baba-Engle-Kraft-Kroner model), will be used here (Engle & 
Kroner, 1995). BEKK model: 

j = lk=l 1 = 1 k=l 

For the needs of the analyses presented in this paper BEKK(1,1) model will be 
used. 

The method of verifying the M-GARCH(1,1) property (in its BEKK(1,1) ver- 
sion) adopted in this paper consists in estimating and significance evaluating of 
model parameters. It is assumed that the M-GARCH(1,1) property may be deemed 
as existent if parameters and (bnm) are significantly different from zero. 

As far as simulations are concerned, pseudo random data with the property of 
M-GARCH(1,1) were simulated using just the BEKK(1,1) model. 

No higher orders for the BEKK model than (1,1) will be considered. Also no 
other classes of varying conditional covariance, such as multivariate stochastic 
volatility, will be analyzed. 



3 The Research 

At the first stage, two-dimensional pseudo-random samples were generated from the 
following models: 

(1) static (random variable), with a constant distribution showing tail dependence, 
using Archimedean copulas of numbers 4, 12 and 14, according to Nelsen’s 
classification 

(2) dynamic (stochastic process), with variable conditional covariance, using 
BEKK(1,1) 

In the simulation study empirical data sets were used only to estimate model param- 
eters. In this part of the research no attempt was done to check if the assumed models 
really fitted the empirical data. The concept of the analysis was to test Value-at- 
Risk model performance if pseudo random loss sequences show different statistic 
characteristics. 

This part of the research consists of three steps. 

It is shown that data generated from BEKK(1,1) may display the property of tail 
dependence, as far as the unconditional distribution is concerned (see Table 2). 

Then, VaR estimation method assuming tail dependence is applied to the data 
coming form various models. Table 3 presents the results. 

It is shown that the model should be rejected for data following the M-GARCH 
process, whereas there are no premises to reject it if the data come from multivariate 
fat-tailed distributions, even if they differ from the one assumed in the VaR model. 
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Table 3 Example 1, - back-test results - data: pseudo-random, VaR model: with Gumbel copula (no, 4 according to Nelsen) 

Model assumed for VaR estimation: copula no. 4 (Gumbel) 

Test significance level: 5% VaR tolerance level: 1% Test sample size: 1,000 
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LRi„d 0.129 3.841 0 LRi„a 0.164 3.841 

LRmi^ 0.563 5.991 0 LR^u 0.268 5.991 

Source: Rokita (2007) 
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In this particular example Archimedean copula no. 4 (Gumbel) was assumed for 
VaR estimation, but the same conclusions were drawn when it was replaced with 
copulas no. 12 or 14 (all the three having the property of tail dependence). 

Finally, BEKK(1,1) is used as a VaR-underlying model and back-tested with the 
data coming from various models (see Table 4). 

This VaR model is not rejected for any data set. Due to a relatively small number 
of observations left for the test part of the samples in the last case, the problem of 
the test power may require some more concern (see, e.g., Piontek, 2008). On the 
other hand, testing 250 last days is in compliance with banking practice. 

In the second stage, real market data sets are used. Logarithmic losses from 
WIG20 and DAX30 indexes are tested for existence of asymptotic dependence 
(static model) and M-GARCH effect. First, the whole sample is taken into consider- 
ation. In the next step, the period the data come from was divided into 4 sub-periods 
of equal lengths. 

Motivation of analyzing data from different sub-periods was a pre-conviction 
that the assumed static model might be inadequate here. If the test results differed 
from period to period, it would be a sufficient premise to state that the property of 
tail dependence or independence was not time-invariant for these time series. That 
would, certainly, also reject the whole concept of a static model, whether it assumed 
asymptotic dependence or independence. 

Tests of asymptotic independence were performed on the ground of the approach 
outlined in the Sect. 2.1. The results for the whole sample and sub-samples shows 
Table 5. 

For the sub-period 2 also TDC estimators were calculated. For other sub-samples 
estimation of this parameter would not make any sense. For more details about TDC 
estimation refer to Schmidt and Stadtmiiller (2003). 

As it is seen from the Table 5, the results of asymptotic independence tests are 
inconsistent in the sense that they change as the sample period changes. This may 
indicate the aforementioned problem of time-varying fatness of joint tail. Another 
issue is the power of the test, as well as quality of TDC estimators used here. These 
are, however, questions going beyond the subject-matter of this paper; though, it 
seems reasonable to return to them in other research. 

Tests of M-GARCH effect were performed using the approach proposed in 
Sect. 2.2. For comparability also in this case both the whole sample and then 4 
sub-samples were analyzed. The results are presented in the Table 6. 

For neither whole sample nor sub-periods was the BEKK(1,1) rejected. 

Einally, three models with Archimedean copulas and then the BEKK were used 
for VaR estimation with regard to the empirical data. Daily 1% VaR estimates 
were back-tested. The length of the whole sample was 2,181 days. The length of 
the test sample was 1,000 days. Each VaR forecast was estimated using a teach- 
ing sample spanning the preceding 1,181 days. Results of the back tests are shown 
in the Table 7. They are not so unambiguous as tests for existence of multivariate 
GARCH effect in the data (Table 6). A model with the Archimedean copula no. 12 
seems to perform best. Copulas no. 4 and 14 are rejected due to overestimating risk. 
BEKK was not rejected by any of Christoffersen tests, neither by Kupiec test of 



Table 4 Example 2. - back-test results - data: pseudo-random, VaR model: BEKK 
Model assumed for VaR estimation: BEKK 

Test significance level: 5% VaR tolerance level: 1% Test sample size: 1,000 
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Table 6 LR test for M-GARCH effect (for WIG20 and DAX30) 



Pair: (WIG20, DAX30) Test significance level: 0.05 



Sample 


Reject 




BEKK parameters 




LRT statist. 


LRT crit. 




M-GARCH 












value 






0.0016 


0.0007 


0.0011 








Whole 


0 


0.1798 


0.0148 


0.0114 


0.2558 


367602.77 


15.51 


sample 




0.9761 


-0.0037 


-0.0021 


0.9610 






Sub-sample 


Reject 




BEKK parameters 




LRT statist. 


LRT crit. 


no. (of 4) 


M-GARCH 












value 






0.0038 


0.0009 


0.0000 








1 


0 


0.2911 


0.1267 


-0.2141 


0.0897 


10,219 X 10’ 


15.51 






0.8780 


-0.1026 


0.1498 


1.0270 










0.0018 


-0.0007 


0.0025 








2 


0 


0.1690 


-0.0142 


0.0071 


0.2855 


63,388.515 


15.51 






0.9763 


0.0213 


-0.0007 


0.9449 










0.0031 


0.0009 


0.0005 








3 


0 


0.1084 


0.1417 


-0.2027 


0.0720 


111,058.11 


15.51 






0.8858 


-0.1055 


0.1624 


1.0048 










0.0043 


0.0019 


0.0000 








4 


0 


-0.0848 


-0.1427 


0.1743 


0.4151 


29, 644 X 10® 


15.51 






0.8742 


-0.1013 


0.1603 


0.9942 







unconditional number of exceedances, but it was rejected by Kupiec test of excee- 
dance independence. Thus, if only VaR back-tests were to be taken into considera- 
tion, an unconditional multivariate distribution with copula no. 12 would be selected 
as a VaR model. However, the research presented earlier in this section indicate that 
if M-GARCH effect is present in the data, it should be also taken into consideration 
when estimating value at risk. 



4 Summary 

In the simulation part of the research it was observed that Value at Risk mod- 
els with time-invariant fat tailed multivariate distributions proved to perform well 
only for the data showing similar tail dependence properties (though not necessarily 
coming from identical models). They were rejected for the data with time-varying 
conditional covariance. 

In the research discussed herein real market data were used twice. First, just for 
estimating parameters to be used in simulation study. Then, the empirical sample 
was tested for existence of statistic phenomena that would be in compliance with 
either the first or the second group of models. The conclusions drawn from the 
tests are that a static tail-dependence model is rejected and there are no premises to 
reject multivariate GARCH. Thus, considering the results of previous simulations. 



Table 7 Back-test results - data: empirical (WIG20, DAX30), VaR models: Archim. copulae: 4, 12, 14 and M-GARCH (BEKK) 
Data: empirical (WIG20, DAX30) 
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any VaR model applied to the data analyzed here must take at least time-varying 
covariance into account and does not need to allow for modeling tail dependence. 

In further research it would be also advisable to compare - in respect of their 
influence on portfolio risk - some other properties of financial time series that may 
result in high-loss co-occurrence (e.g., extreme dependence in conditional distribu- 
tions of a return process, varying unconditional covariance of a return process, and 
market contagion phenomenon). 
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Is Log Ratio a Good Value for Measuring 
Return in Stock Investments? 



Alfred Ultsch 



Abstract Measuring the rate of return is an important issue for theory and prac- 
tice of investments in the stock market. A common measure for rate of return 
is the logarithm of the ratio of successive prices (LogRatio). In this paper it is 
shown that LogRatio as well as arithmetic return rate (Ratio) have several disad- 
vantages. As an alternative relative differences (RelDiff) are proposed to measure 
return. The stability against numerical and rounding errors of RelDiff is much bet- 
ter than for LogRatios and Ratio). RelDiff values are identical to LogRatios and 
Return for small absolutes. The usage of RelDiff maps returns to a finite range. For 
most subsequent analyses this is a big advantage. The usefulness of the approach 
is demonstrated on daily return rates of a large set of actual stocks. It is shown that 
returns can be modeled with a very simple mixture of distributions in great precision 
using Relative differences. 

Keywords Log ratio • Stock market. 



1 Introduction 

The daily rate of return for stock is an important figure, not only for practical peo- 
ple, who want to see how their portfolio performs, but also in many theories on 
market risk. A model of the distribution of daily stock returns is a prerequisite 
for many theories. For example the Black and Scholes’ formula for options (Black 
& Scholes, 1973) relies on the assumption that daily returns are log normal dis- 
tributed. Markowitz portfolio theory is built on the assumption that returns follow a 
Gaussian normal distribution (Markowitz, 1952). It is known, however, that actual 
returns measured in the market do not follow these model distributions (Aas, 2004; 
Nawroth & Peinke, 2006). Return may be calculated using different formulas. The 
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most common used are arithmetic return ratio and LogRatio. Both measures have 
the disadvantage of an unsymmetrical and unbound range. In this paper relative dif- 
ferences are introduced to measure returns. It is shown that this measure leads to a 
simple but very precise model of actual observed returns. The approach is tested on 
a large database of daily stock prices. 



2 The Data 

In this paper data from stock markets in the US was used. During the period January 
1,2000 until March 1,2008 daily stock prices were extracted from a public source 
(Yahoo® finance). The stock prices are the daily closing prices adjusted for splits 
in the past. This gave 2,047 days of (adjusted) closing prices. During this period the 
Standard and Poor’s 500 Index (S&P500) had a range between 776 and 1,566. The 
S(&P500 showed periods of positive and negative slope of about the same length. In 
order to avoid effects from stocks with very small prices (penny-stocks) and very 
high priced stocks, only such stocks were used that had during 99% of the time 
period a price in the interval from $2 to $25. A set of 7,030 stocks fulfilled this 
condition. Overall the following empirical results are based on 7,030 stocks *2,047 
prices. This is more than 14 million numbers. 



3 Measuring Daily Return 



A straight forward measurement of return is the Ratio R (also known as arithmetic 
return): 

^ P (today) — P (yesterday) 

P (yesterday) 



where P (c?) is the closing price of a stock at day d. This is, however not the only 
way to measure a sock’s performance. The next common measure is LogRatio LR 
(Aas, 2004): 

P (today) 

P (yesterday) 




LogRatios have the advantage that returns of longer periods can be simply calculated 
by multiplying the LogRatios of the intermediate periods. Furthermore, if stock pric- 
ing is assumed to be a time continuous process LogRatios are the infinitesimal limit 
of the arithmetic returns. Equation (1) can be rewritten to 



P (today) 

R + \ = 

P (yesterday) 

Using a Taylor series approximation for In (R -|- 1) gives In (R -|- 1) = R for small 
R. Figure 1 compares Ratio and LogRatio for the data described below. It can be 
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-100 0 100 200 300 400 500 600 700 800 900 

Ratio [%] 

Fig. 1 Ratio and LogRatio for empirical data 



seen, that for |R| < 10% Ratio and LogRatio give almost the same value. Figure 1 
also shows one of the problems of Ratio and LogRatio: the range is not symmetri- 
cal for positive and negative values. Positive Ratios (gains) of higger than 1,000% 
were observed in practice. Total loss (ruin) is, however limited to —100% for Ratio. 
For LogRatios this problem is worse. If the price of today or yesterday is close to 
zero LogRatio is numerically instable towards infinity. For a model, this has the 
consequence that gains and losses must be described differently. 

The unconfined range of both measures has another nasty consequence for small 
portfolios. Consider for example a small portfolio of three stocks. At one time d\ 
these stocks may have LogRatios (or Ratios) of {5%, 8%, 600%} at another day 
d .2 LogRatios of {5%, 8%, 6%}. While for c ?2 the calculation of an average return 
(6.33%) makes sense, the same calculation is biased by the extreme gain of the third 
stock in d\. The same holds for an estimation of the variances. In particular, if a 
comparison of days in terms of returns is wanted, one might want to use the sum of 
differences (or Euclidean distance) on the days. This is also extremely biased by the 
unbound gain. 

A different definition of returns in the form of Relative Differences (RelDiff) 
alleviates these problems. Define 



RelDiff = 



F (today) — F (yesterday) ^ (F (today) — F (yesterday)) 

\ (F (today) -|- F (yesterday)) (F (today) F (yesterday)) ' 



( 2 ) 



RelDiff means a comparison of gains and losses to the average price of both days, 
yesterday and today. For the ranges |Ratio| < 25% and |LogRatio| < 60% RelDiff 
is numerical identical to Ratio resp. LogRatio, see Fig. 2 
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LogRatio [%] 

Fig. 2 Comparison of LogRatio and RelDiff for the empirical data 



Ruin is not a numerical problem for RelDiff: P (today) = 0 in (2) gives a value 
of —200%. For extreme gains P (today) + P (yesterday) » P (today) + P (yesterday) 
(2) results in Reldiff = 200%. This means RelDiff has a symmetrical and limited 
range of -200% < RelDiff < 200%. 



4 The Distribution of Daily Returns 

A model of the distribution of daily returns is important for theoretical and prac- 
tical purposes. Markowitz’ theory assumes normal distribution of Ratios. Portfolio 
theory according to Black & Scholes (1973) assumes log normal distribution of 
Ratios which is equivalent to a normality assumption for LogRatios. It is well 
known, however, that both assumptions do not hold in practice. Figure 3 shows a 
quantile/quantile (QQ-) plot of Ratios, LogRatios and RelDiffs. 

It can be seen that the normality assumption is appropriate for small absolute 
Returns. This holds for about 90-95% of the data. For larger gains or losses the 
distributions are leptocurtic. RelDiff with its limited range has the best potential for 
a precise model. 



5 Modeling the Distribution of Returns 

The suitability of the different measures for return was tested via a model of the 
distributions. To address the leptokurtic nature of the distribution a central Gaussian 
plus a LogNormal distribution at each side (gains and losses) was used. These mix- 
tures of Log-Gauss-Log were optimized for Ratio, LogRatio and RelDiff using the 
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mcdtic of lvg»rcum r*d>RdDiH.nr»a^U»^=t«»o.blBr*«jm 




Fig. 4 Q/Q plot of winner returns w.r.t. LogNormal compared to a linear fit 



EM algorithm (Bilmes, 1997). As quality measure the linearity of a Q/Q plot vs. the 
model distribution is used. For the central Gaussian all three measures showed com- 
plete linearity. This can be seen in Fig. 3. For losses LogRatio and RelDiff the Q/Q 
plot showed a good linear relation. Ratio measures for losses did not show a linear 
relation to the LogNormal distribution (see Fig. 3). For the gain side of returns the 
situation is as shown in Fig. 4. A linear function has been interpolated for LogRatio 
and RelDiff in Fig. 4. 
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For the Ratio measure there a linear interpolation is not appropriate. This indi- 
cates that a Log-Gauss-Log model is not suitable for this measure. For LogRatio 
and RelDiff the linear model is appropriate for values up to about 15%. For larger 
returns only RelDiff can be reasonably modeled with this type of distribution. 



6 Discussion 

Using relative differences (RelDiff) as a measure of return has several advantages: 
RelDiff is in a wide range identical to Ratio and LogRatio. RelDiff is easy to under- 
stand: RelDiff is the price difference compared to the average price. RelDiff has 
less numerical problems than the other measures with regard to ruin, penny stocks 
and exorbitant gains. RelDiff has a confined and symmetrical range for both gains 
and losses (—200% to 200%). This makes a symmetrical model for gains and losses 
possible. For clustering and measuring performances of small portfolios the out- 
lier problems are alleviated. An integrated simple model of returns with a mixture 
of three components was possible. Figure 5 shows a Q/Q-plot of this model for all 
data. 

The huge number of data used for Fig. 5 allows the statement, that returns can be 
modeled precisely with a Log-Gauss-Log mixture of distribution on RelDiff. This 
model can be interpreted as follows: there is one mode of the stock marked that pro- 
duces random fluctuations in stock prices. These random fluctuations are Gaussian 
distributed with zero average. Gains and losses are produced by the marked with 
processes different from the central random walk. The magnitude of gains and losses 
can be described appropriately by a LogNormal distribution of relative differences. 




Fig. 5 Q/Q plot of returns measured in RelDiff compared to a Log-Gauss-Log model distribution 
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In order to model portfolio and market risks this model can be used to obtain 
a more precise risk model. The transformation of RelDiff using the distribution 
model to posterior probabilities for the membership in the classes “Gain”, “Loss” 
“Random Fluctuation” gives the chance for a better characterization and predic- 
tion of short time periods and small portfolios. Compare the PUL method for DNA 
microarray data analysis in this proceedings volume (Ultsch, Pallasch, Bergmann, 
& Christiansen, this volume). 



7 Summary 

The daily rate of return for stock is an important figure. Not only for practical peo- 
ple, who want to see how their portfolio performs, but also in many theories of 
market risks. A model of the distribution of daily stock returns is a prerequisite for 
many theories. For example the Black and Scholes’ formula for options (Black & 
Scholes, 1973) relies on the assumption that daily returns are log normal distributed. 
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Designing Products Using Quality Function 
Deployment and Conjoint Analysis: 

A Comparison in a Market for Elderly People 



Samah Abu-Assab and Daniel Baier 



Abstract In this paper, we compare two product design approaches, quality func- 
tion deployment (QFD) and conjoint analysis (CA), on the example of mobile 
phones for elderly people as a target group. Then, we compare between our results 
and the results from former similar comparisons, e.g., Pullman et al. (J Prod Innov 
Manage 19(5):354-364, 2002) and Katz (J Innov Manage 21:61-63, 2004). In this 
work, the same procedures and conditions are taken into consideration as that taken 
by Pullman et al. in their paper. They viewed the relation between the two meth- 
ods: QFD and CA as a complementary one in which both should be simultaneously 
implemented since each provide feedback to the other. They concluded that CA 
is more efficient in reflecting the end-users’ present preferences for the product 
attributes, whereas QFD is dehnitely better in satisfying end-users’ needs from the 
developers’ point of view. Katz in his response from a practitioner’s point of view 
agreed with Pullman et al. However, he concluded that the two methods are bet- 
ter used sequentially and that QFD should precede conjoint analysis. We test these 
results in a market for elderly people. 

Keywords Conjoint analysis • Product design • Quality function deployment. 



1 Introduction 

Conjoint analysis (CA) is widely accepted by marketing researchers as a tool to 
measure consumer preferences, whereas quality function deployment (QFD) is 
often used to translate the customer requirements into appropriate technical require- 
ments for the various stages of product development, e.g., Sullivan (1986), Terninko 
(1997). Since the 1980s, the traditional CA and QFD approaches have been pro- 
posed, disseminated and improved. However, as both traditional approaches are 
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burdened with a number of practical and theoretical problems, recently, many 
researchers and practitioners have focused on finding more efficient methods to 
overcome these obstacles and improve the applicability and the results of those 
methods, e.g., Baier (1998); Pullman, Moore, and Wardell (2002); Katz (2004); 
Baier and Brusch (2005); Kazemzadeh, Behzadian, Aghdasi, and Albadvi (2009). 

So, e.g., Kazemzadeh et al. (2009) pinpointed many of the problems that the 
traditional QFD approach are fraught with: the difficulty to differentiate between 
diverse and conflicting customer demands and needs, the hardship to prioritize 
customer needs and engineering requirements with the rating used, the imprecise 
way in which the customer needs are translated and correlated among technical 
requirements as well as the relation between customer requirements and technical 
requirements, and - finally - the necessity to trade off among the various customer 
needs. The authors used CA within QFD to measure the importance of different cus- 
tomer needs and applied cluster analysis for benefit segmentation. Baier and Brusch 
(2005) demonstrated a new approach “conjoint QFD” that combined CA into QFD. 
The correlation between the customers’ needs and the engineering characteristics 
as well as the importance of the customer needs are estimated using CA. Then they 
compared the predictive validity of the new approach “conjoint QFD” with the tra- 
ditional approach. Their results designated by hook or by crook that the validity 
of the new approach surpasses the validity of the traditional QFD one. Pullman 
et al. (2002) compared QFD and CA by applying each at the example of a new 
all-purpose harness for the beginning/intermediate ability climber which added to 
an existing harness-product line of a leading manufacturer. In their conclusion, they 
stressed the fact that the two approaches are not competitive but rather complemen- 
tary and should be simultaneously implemented in which each supply a feedback to 
the other. In his reply, Katz (2004) concluded from a practitioner point of view that 
the two approaches should be considered as supplementary rather than complemen- 
tary and that QFD should be implemented first in the early product development 
stages and then be followed by CA. However, further comparisons are needed. 

Consequently, in this paper we compare QFD and CA for product design in a 
similar way like Pullman et al. The application held is the mobile phone market 
for the elderly group 50 plus (50 years old or elder). The paper is structured as 
follows: In the next section, Pullman et al.’s experiment is described including their 
results. Then, the own comparison is described. Finally, the results from Pullman 
et al. and our comparisons are summarized. The paper closes with conclusions and 
outlook. 



2 Product Design in a Climbing Harness Market 

A quick review of the Pullman et al.’s experiment (description, settings and results) 
is presented in this section. The design object was an all-purpose climbing harness 
for the beginning/intermediate ability climber. Figure 1 (from Pullman et ah, 2002) 
shows its key features. 
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For applying CA, firstly, an expert team was formed that collected a lot of features 
for climbing harnesses from numerous resources (e.g., self-inspection, catalogues, 
and discussions with others). After a managerial judgement the initial list was 
decreased to nine attributes with minimum two to maximum five levels each for 
the conjoint experiment. Then, a questionnaire was developed where these differ- 
ent harness features where described, including 20 harness conjoint profiles and 
two harness choice sets. Finally, conjoint data were collected from 105 respondents 
and analyzed applying Hierarchical Bayes logistic regression as analysis method. 
Figure 1 already shows the resulting average utility weights for the nine selected 
attributes. As a result, it could be seen that - on average - respondents had higher 
utilities for brand B, stuffed webbing harnesses, wide waist belts, threaded buck- 
les, a belay loop, four gear loops, a dedicated tie-in loop, adjustable leg loops, and 
lowest prices. 




Feature 


Levels 


Brand 


A (-0.265); B (0.132); C (0.065); D (0.069) 


Harness construction 


Webbing/Fleece (—0.040); Stuffed webbing (0.071); 
Laminate foam (—0.053); Thermo-formed (0.023) 


Waist belt width 


Narrow (—0.114); Wide (0.114) 


Buckle 


Threaded (0.401); Non-threaded (—0.401) 


Belay loop 


Yes (0.440); No (-0.440) 


Gear loops 


Two (—0.670); Four (0.670) 


Dedicated tie-in loop 


Yes (0.241); No (-0.241) 


Leg loops 


Fixed (—0.414); Adjustable (0.414) 


Price 


$39 (0.500); $50 (0.167); $61 (-0.167); 
$72 (-0.500); $83 (-0.833) 



Fig. 1 All-purpose climbing harness with average utilities for attribute-levels 
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2.2 Application of Quality Function Deployment 

For constructing the so-called house of quality, firstly, customer needs (CNs) had to 
be collected. Seventeen one-to-one interviews were conducted in a climbing gym. 
Then, a second group of climbers were asked to rate the importance of these CNs 
on a six-point-rating scale. A third group of 30 respondents assessed the competi- 
tive brands A, B, C, and D. They were asked to rate how well each of them met the 
different CNs, again on a six-point-rating scale and an additional ten-point-buying 
intention scale. Secondly, the engineering characteristics (ECs) and target values 
for them were dehned by using workshops with the above described expert team. 
They identihed one or more measurable ECs for each CN and esteemed the cor- 
relation between each EC and each CN on a —5 to h- 5 rating scale. Einally, the 
part-deployment matrix was constructed with ECs as rows and the so-called design 
feature (DFs) as columns. The ECs were taken from the house of quality whereas 
the DFs were specihed by the expert team. The correlation between ECs and DFs 
are determined using a —3 to h- 3 scale. Table 1 from Pullman et al. (2002) shows the 
results. It can be seen that the target harness should have a soft Inside fabric, web 
fleece construction, narrow waist belt, adjustable wide range belt, a non-threaded 
buckle, a belay loop, four gear loops, no dedicated tie-in-loop; adjustable leg loops 
with the lowest price and come in five different sizes. 



3 Product Design in a Mobile Phone Market 

In this section, a thorough description of the experiment that the authors con- 
ducted will be presented and results will be demonstrated. Eventually, a comparison 
between the two experiments will be conducted. 



3.1 Application of Conjoint Analysis 

The expert team identified quite a number of key attributes for a mobile phone by the 
group 50 plus by using various resources mostly from articles, discussions, surveys, 
and so forth. The matrix was then reduced to nine key attributes with three levels 
each (see Table 2). An adaptive conjoint analysis conjoint experiment was developed 
using Sawtooth Software (2002). One-to-one pre-test interviews were used to make 
sure that the questionnaire is apprehensible and complete. The first section of the 
questionnaire was to rate the levels’ preference on a seven-point-rating scale. Next 
the attributes’ relative importance was determined. This information is useful to 
discard the relatively unpopular attributes from further evaluation besides it supplies 
information upon which the initial estimates of the respondents’ utilities can be 
based (Sawtooth Software, 2002). 

Subsequently, the paired comparison trade-off questions followed in which con- 
joint tradeoffs are collected. Eourteen paired questions were selected with seven 
pairs of questions with two attributes and seven with three. At last the calibration 



Table 1 Part-deployment matrix for all-purpose climbing harnesses 
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Table 2 Attributes and levels for mobile phones in the CA experiment 



Feature 




Levels 




Form (8.62) 


Folding (9.29) 


Sliding (-7.24) 


Standard (-2.04) 


Size (10.08) 


Big (-11.61) 


Medium (25.66) 


Small (-14.05) 


Display (9.65) 


Big normal (6.77) 


Medium normal (14.36) 


Small sens. (—21.13) 


Battery time (12.36) 


3 days (-49.75) 


7 days (5.03) 


10 days (44.72) 


Mobile phone price 


30 euro with contract 


80 euro without 


150 euro without 


(13.80) 


(-7.15) 


contract (37.22) 


contract (—30.07) 


Running costs (13.83) 


25 euro/month 


Prepaid card 


10 euro/month 9ct/min 




5 free numbers (—31.66) 


15 euro (35.25) 


to 5 numbers (—3.59) 


Intelligent functions 


Emergency call with 


Program, emergency 


Defined emergency 


(10.48) 


pos. localization (19.65) 


number (—8.65) 


number (—10.99) 


Keyboard (11.73) 


Big (12.74) 


Medium (27.40) 


Small (-40.14) 


Additional functions 


SMS, voice output, 


Voice output 


Voice command 


(9.44) 


voice command (28.06) 


(-10.49) 


(-17.56) 



concepts section followed. Although this is optional, yet it is very important in 
scaling the utilities from rating scales to buying intentions. Fifty-four completed 
questionnaires from elderly people could be used for evaluation. Table 2 shows 
the results: On average, respondents have higher utilities (i.e., largest value in each 
row) for folding mobile phone, medium size, medium and normal display, 10 days 
battery standby time (longest standby time), 80 euro without contract (lowest price 
when including the running cost), prepaid card 1 5 euro, emergency call with position 
localization, medium keyboard, with SMS, voice message and voice command. 



3.2 Application of Quality Function Deployment 

For constructing the so-called house of quality, again, customer needs (CNs) had to 
be collected. They were generated from seventeen one-to-one interviews (which is 
according to Griffin & Hauser, 1993 a sufficient number to draw out the majority 
of relevant product needs). The respondents were randomly selected people over 50 
in Cottbus area. During the interviews, respondents were asked to talk about their 
relation with their mobile phones. So, e.g., they were asked how frequently they 
use them, whether they send SMS or not, what they feel the advantages of some 
mobile phones are and how much they have paid for their mobile phones. Three 
expert team members (again, the expert team was the same for the application of CA 
and QFD) independently read and analysed the interviews’ transcripts and grouped 
the statements into CNs from the point of view of respondents. Six primary CNs 
with one to three secondary CNs were deduced. Then, 30 respondents rated the 
importance of the secondary and the primary CNs on a six-point-rating scale. After 
that the secondary CNs were rescaled so that the sum of all secondary CNs was 
equal to its primary CNs importance (Table 3). 
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Next, three competitive mobiles phones were rated from the perception of the 
group 50 plus. The three products: Nokia 6300 (standard form), Nokia E65 (sliding 
form), and Motorola V8 (folding form) were selected as good examples of the three 
forms of mobiles that can be used by this target group. Thirty respondents rated these 
products on a six-point-rating scale, evaluating all mobiles on the same secondary 
need before moving to the next. Eventually, they rated the likelihood of purchasing 
each mobile on a ten-point-buying intention scale. The results of the comparison 
showed that Nokia 6300 lies comfortably in the hand and with a very long battery 
duration time. However, it was perceived to be the least preferred mobile phone. 
Regarding Nokia E65, it is considered “cheap” and “comfortable in the hand” with 
longest battery duration in this experiment. Yet it is considered to be the least “easy 
to use”, least “easy to call”, and least “easy to read”; whereas. Motorola V8 was 
clearly perceived to be the most “easy to use”, most “easy to call”, most easy “to 
hang a call”, and most “easy to read keyboard”. Therefore Motorola V8 is best rated 
in the survey by the group 50 plus. 

Then, as in Pullman et al.’s experiment, the expert team identified one to three 
ECs (engineering characteristics) for each CN and assessed the correlation between 
the CNs and the ECs on a —5 to -|-5 scale as well as the change tendencies for 
each determined ECs. Regarding the strength of the relationship between the ECs 
pair (i.e., roof of the house of quality), it was considered to have a small number 
of interdependencies and for this reason it was not demonstrated. The impact of 
preferences for the ECs was calculated by multiplying each EC correlation value 
with its CN’s importance and summing over all the CNs. The results show that the 
features with the greatest impact on the customer preferences are: battery capacity 
(27.4), power consumption (26.1), cost (22.9), display brightness (18.0), robust of 
sending signal (17.4) and number of menu layer (15.27). The analysis indicates that 
the preferred mobile phone for the target group 50 plus should have long battery 
duration, good display brightness with no reflexion, minimum cost besides a robust 
sending signal and a minimum number of layer menus as possible to make it easier 
and more comprehensive to use. 

Finally, the parts deployment matrix was constructed (see Table 4). Using again 
the ECs as rows, the design features (DFs) make now the columns of the matrix. 
These features were specified and rated on a —3 to 4-3 scale by the expert team 
which were related to each EC and eight features from CA. The correlation between 
ECs and DFs were compared in the same way as in constructing the house of quality 
between the CNs and ECs. The features with the highest impact on meeting cus- 
tomer requirements are mobile phone without SMS/MMS (92.04), medium sensitive 
display (92.02), 10-days (long time battery) battery standby (90.46), high robust of 
sending signal (87.71), not very bright (dark) display (81.59), 80 euro without con- 
tract mobile (66.10), with few number of menu layers (46.84), medium keyboard 
(32.89), with defined emergency number (27.44) and small volume mobile (1.29). 
Part deployment analysis shows that the target mobile phone should have no SMS 
function, with a medium sensitive display, long standby battery, with high sending 
signal and a not very bright display, low cost mobile (with low running cost), simple 



Table 4 Mobile phones for elderly people: parts deployment matrix 



Designing Products Using QFD and CA: A Comparison 



523 



|Target value | 


1 Importance | 


|Display brightness 


1 Keyboard size 


SIAIIAI/SIAISI 


|Robust of sending signal | 


I Emergency key | 


I Battery standby | 


|Cost 1 


1 Menu layer | 


1 Display size | 


|Volume(height*width*depth) | 








I -17,95 1 


11,67 


o 

CO 


1 -17,35 I 


I 14,93 I 


I 26,71 1 


1 22,85 1 


1 15,27 1 


O) 

"cD 


10,06 


Importance 


- 


K) 

CD 




- 








- 






- 


CO 


Volume (small=1, medlum=3, 
big=2) 


- 


1 -92,02 1 




—X 








- 


- 




CO 


_Jk 


Display size (medium, 
normai=3, medium, 
sensitive=1, big, normal=2) 


w 


O) 

00 






- 










CO 






Menu Layer (few iayers=3, 
many iayers=1) 


Ca) 


1 66,10 1 








- 






CO 




- 




Cost 

(2=30€wlth contract, 3=80€, 
1=150€) 


w 


1 90,46 1 


- 




- 


- 




CO 


- 




- 


- 


Battery standby 
(3days=1 , 7days=2, 
1 0days=3) 


CO 


1 27,44 1 










CO 












tmergency caii (emergency 
caii w position looaiizatlon=3, 
defined emergency 
number=1'l 


- 


-87,71 






- 


CO 


- 


- 


- 








robust of sending signai 
(high=1, low =2) 




1 -92,04 1 




A 


CO 


_L 














sms/mms (yes=3, no=1 ) 


CO 


1 32,89 




CO 














- 


_1 


keyboard size (smalUl, 
medium=3, big=2) 


ro 


1 -81,59 I 


CO 




- 
















display brightness (relative 
bright=1, relative dark=2) 



524 



S. Abu-Assab and D. Baier 



menu layers, medium keyboard with a defined emergency number and a small size 
mobile. 



3.3 Comparing the CA and QFD Results 

In both approaches the optimal mobile phone had the longest battery standby dura- 
tion (10 days), economical price (80 euros in this case) and a medium keyboard. 
Whereas, QFD and CA differed somehow in respect to the three optimal levels: 
display, intelligent functions (i.e., emergency), additional functions (i.e., SMS) and 
mobile size. Cost (price) ranked first in importance in CA whereas by QDF cost 
(price) was not considered so important yet in both approaches the same economical 
price was yielded. In the CA approach, the attribute “additional functions” was not 
considered so important but “SMS” was the most important level. On the contrary, 
QFD’s most significant design feature was the absence of the SMS function which 
could be accounted for the strength of the negative interdependency of SMS function 
with many other important ECs. The attribute “mobile’s size” yielded a big differ- 
ence in both approaches. With QFD, it was estimated to be the least important design 
feature whereas with CA it was weighted to be rather significant. These divergent 
results may have occurred because of the different basic conception behind each 
approach: QFD is optimizing DFs for production based mainly on the perception of 
the expert team whereas CA is optimizing the attributes based on the perception of 
customers’ needs. These customers’ perceptions sometimes contradict with design 
features (e.g., customers want mobile to be relatively small with relatively large key- 
board or/and with a large display), thus creating a big challenge for the design and 
production team. 



3.4 Comparing the Results with Pullman et al.’s Experiment 

One can agree with Pullman et al. that QFD and CA optimize results according to 
their own criteria and the amount of these differences (see Table 5) implies that the 
two methods are optimizing rather different functions. 

The main focus of the expert team was to figure out the estimations of the most 
important features. Therefore, here, we compare the relative importance of common 
features for the mobile phones for the target group 50 plus (see Table 5) and we also 
show the result of the comparison that Pullman’s et al. conducted. In both experi- 
ments, QFD importance was measured by the feature’s contribution to overall need 
satisfaction. Regarding the conjoint analysis, attributes’ importance were calculated 
only in the traditional way, in which each attribute’s utility was calculated as the 
average difference in the importance of its most and least preferred levels. Although 
Pullman et al. measured the importance of CA in three ways, yet in this paper it was 
only measured and compared in the traditional way because of lack of information 
when running the experiment. 
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Table 5 Comparison of design feature importance in Pullman et al.’s experiment (left table) and 
the mobile phone experiment (right table) 



Design feature 


QFD 


CA 


Design feature 


QFD 


CA 


Harness construction 


34.6 


9.3 


Volume 


1.40 


42.03 


Price ($50-$72) 


43 
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Number of gear loops 
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100 
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100 


Belay loop 
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Type of buckle 


100 
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Emergency call 


7.89 


32.43 


Dedicated tie-in loop 


0.3 


36 


Keyboard 


35.73 


71.49 


Type of leg loops 


3 


61.8 


SMS 


100 


48.27 


Waist belt width 33.2 


17 











The correlations between the CA and QFD utilities weights from Table 5 are 
0.319 for Pullman et al.’s experiment and 0.390 for the mobile phone experiment. 
Both correlation indicate a fairly weak correlation between CA and QFD importance 
measures with a slightly better correlation in the new mobile phone experiment. 



4 Conclusions and Outlook 

In their conclusion, Pullman et al. were restricted not to generalize their results out 
of one study and recommended further research. Therefore, this paper is a con- 
tribution in this research held. Obviously, the results of the two approaches share 
some common recommendations as well as they differentiate in some aspects. Yet 
these deviations between the two approaches are logically explainable since CA 
reflects the customer wishes and desires whereas QFD represents the engineer- 
ing/management view of what the customer needs in addition to the fact that the 
two considered research products differs, too, in their complexity. 

The employment of attributes from conjoint study to design features through 
engineering characteristic fosters innovative solutions and new deployment of the 
design process (Pullman et ah, 2002). At the end, it is clear that implementing 
the two methods together is recommended to get a more accurate and thoroughly 
information. In addition, the integration of the two approaches make use of the 
advantages of each and gaps many of their weaknesses. 
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Analyzing the Stability of Price Response 
Functions: Measuring the Influence of Different 
Parameters in a Monte Carlo Comparison 
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Abstract The usage and the estimation of price response function is very impor- 
tant for strategic marketing decisions. Typically price response functions with an 
empirical basis are used. However, such price response functions are subject to a 
lot of disturbing influence factors, e.g., the assumed profit maximum price and the 
assumed corresponding quantity of sales. In such cases, the question how stable the 
found price response function is was not answered sufficiently up to now. In this 
paper, the question will be pursued how much (and what kind of) errors in market 
research are pardonable for a stable price response function. For the comparisons, a 
factorial design with synthetically generated and disturbed data is used. 

Keywords Monadic approach • Monte Carlo comparison ■ Price response functions. 



1 Introduction 

The usage and therefore the estimation of price response functions is very important 
within strategic marketing. Typically price response functions with an empirical 
basis are used (see, e.g., Balderjahn, 1998; Steiner, Brezger, & Belitz, 2007). Such 
price response functions are subject to a lot of disturbing influence factors, e.g., the 
assumed profit maximum price or the assumed corresponding quantity of sales. A 
major problem is the information how stable a found price response function is. 

This contribution focuses on the question, how much (and what kind of) errors in 
market research are pardonable for a stable price response function. An investigation 
with regard to innovative technologies and systems of house power engineering (see, 
e.g., Brusch, Ziihlsdorff, Baier, & Kessler, 2003) has been the starting point for 
our contribution. Therefore, the fundamentals of price response functions will be 
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briefly described (Sect. 2). The analysis itself (Sect. 3) is based on a Monte Carlo 
Comparison with a factorial design with synthetically generated and disturbed data. 
A discussion closes this contribution (Sect. 4). 



2 Price Response Functions in Marketing 
2.1 Alternatives of Price Response Functions 

Typically, four alternatives of price response functions are possible in marketing 
practise (see Fig. 1 and, e.g., Diller, 2001; Simon, 1992): 

• A: linear model 

• B: multiplicative model 

• C: double-bent model (Gutenberg function) 

• D: logistic model 

The linear model is a simple standard model. In most cases it can be stated that 
in an observed range all models have a quasi-linear trend. Significant differences in 
the internal validity could not be identified. However, the use of the model should 
be limited to cases where the price range is similar to the already investigated ones. 

The multiplicative model is denoted as less robust. Especially the missing of a 
maximum price is to criticize. This implies big leeway for price increases which 
will be realistic for the fewest goods. 

The model based on Gutenberg is often called the double-bent linear model. It is 
expected that it describes the reality best (see, e.g., Simon, 1992). For detailed anal- 
ysis the curve can be divided into separate linear sections. For example, the section 
in the middle shows that a company can vary the price without loosing customers to 
or winning from a competitor. 

The logistic model combines the same critical aspects as the multiplicative 
model. In case of very low or very high prices the probability of a misjudgment 
is very relevant. The curve of the price response function is very flat here. 




Fig. 1 Alternatives of price response functions 



Analyzing the Stability of Price Response Functions 



529 



In early empirical studies of 1982 and 1985, Simon and Kucher were not able 
to find a superiority of a model. They concluded that the Gutenberg model appears 
most valid in an economic sense, however it is very time-consuming to estimate (see 
Simon & Kucher, 1988). 

For the later analysis, only the linear model in the form of p = mx -|- n (where p 
denotes the price and x denotes the sales quantity) will be used. On the one hand on 
account of simplicity, on the other hand on account of empirically good results of 
the linear models. Furthermore, in marketing literature discussions with respect to 
(w.r.t.) the connected values (e.g., sales maximum price) are mostly based on linear 
price response functions. 



2.2 Price Response Functions and Connected Values 

If knowing the price response function the connected values can be calculated. 
While deriving the maximum values with the maximum price (j)) and the maximum 
quantity (x) the two for the management most relevant values can be calculated. 
These are typically: 

• The price (and corresponding demand) for the sales maximum point with 

~p 

PS* = f . 

• The price (and corresponding demand) for the profit maximum point with 

V “h C 

Pp* = ‘^— 2 — - (where Cy denotes the variable costs). 

The presence of these information allows the marketeer detailed (price) analyzes. 
For example, this allows marketers to consider both costs and demand in calculating 
a price that maximize profits (see pp* in Fig. 2). 



2.3 Instruments for Measuring Price Sensitivity 

For estimating the price response function individual price information is required. 
For measuring these price information and/or these price sensitivity three different 
groups of instruments exists (see, e.g., Sattler & Nitschke, 2003; Skiera, 1999): 

• Revealed preferences 

• Stated preference data 

• Buy offers (bids) 

In the case of revealed preferences actual purchases are registered and there- 
fore high validity is expected (see, e.g., Ben-Akiva, Bradley, Morikawa, Benjamin, 
Novak, et ah, 1994). Common are experimental alternatives (e.g., test-market simu- 
lation) and non-experimental alternatives (e.g., scanner panel data). 

For stated preference data the widely known and often used investigation of pref- 
erence is carried out (see, e.g., Ben-Akiva et ah, 1994). The first alternative is direct 
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Fig. 2 Price response functions and connected values 



questioning for what three options exists. One option is asking for the maximum 
price which will be paid (see, e.g., Kalish & Nelson, 1991). Another option is the 
use of different prices for the same product including a question to the maximum 
price where the product will be bought (see, e.g., Gabor & Granger, 1966). A third 
option is asking for the price importance in relation to other product attributes within 
so-called self-explicated models (see, e.g., Srinivasan, 1988). The second alternative 
for getting stated preference data is indirect questioning. Here, the use of conjoint 
analysis, where the price is only one attribute among others, is very common (see, 
e.g.. Green & Srinivasan, 1978; Green, Krieger, & Wind, 2001). 

In the case of buy offers respondents are offered bids. Here, the practical use dif- 
fers depending on the alternative, for example the very common auctions (see, e.g., 
Hoffman, Menkhaus, Chakravarti, Field, & Whipple, 1993; Skiera & Revenstorff, 
1999; Vickrey, 1961) or lotteries (see, e.g., Becker, DeGroot, & Marschak, 1964; 
Volckner, 2006). 

For the following Monte Carlo comparison the monadic approach (as an alterna- 
tive of direct questioning to obtain stated preference data) is used. Here, the sample 
is split into groups of equal size. Each group is presented a question about the use of 
the product at one of the price points, i.e., the price is part of the product prohle. For 
estimating the parameter of the price response function a linear regression analysis 
can be carried out (see Fig. 3). 
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Market share 
product X 




Fig. 3 The Monadic approach for estimating price response functions 



3 A Monte Carlo Comparison 
3.1 Research Design 

In order to answer the question, which factors have which influence on price 
response functions and their connected values (especially the sales and maximum 
points) synthetic data were generated and analyzed. Therefore, different theoretical 
and empirical investigations were checked for characteristics. At the end, a factorial 
design with five factors - each with three levels - results. The first three factors con- 
sider the price response function (see Sect. 2.2) as well as the underlying Monadic 
Approach for price effect estimation (see Sect. 2.3) and varies: 

• The steepness of the price response function (e.g., low, medium or high steepness 
of the curve of the price response function) - factor 1 

• The number of groups used at the Monadic Approach (e.g., few, medium or many 
groups and/or asked prices while generating and estimating the linear regression 
function) - factor 2 

• The number of respondents for each group (e.g., few, medium or many respon- 
dents within each group while generating and estimating the linear regression 
function) - factor 3 

The factors four and five are related to the disturbance of the “estimated” data 
and varies: 

• The random error in measuring the sales estimation (using normal distributions 
with small, medium, and large standard deviations for generating additive error) - 
factor 4 

• The systematical error in measuring the sales estimation (using none, small and 
moderate displace of sales estimations) - factor 5 
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Basing on this factorial design the generation of typical pricing information using 
three necessary types of data followed: 

• “True” data were generated in accordance with the respective level of the first 
three factors (factor 1, 2 and 3). In the following the “true” sales maximum and - 
more important - the sales quantity were calculated which corresponds to the 
estimated optimal prices under the underlying price response function. 

• “With market research” data were generated in accordance with the respective 
level of all factors (factor 1 to 5). Linear regression and the following choice pro- 
cedure for simulating real buying decision of the respondents were used. Here, 
Prob(p, buy) is the probahility where a respondent in a decision task (simu- 
lated as a random uniform distributed probability Prob(ex), where ex denotes 
the estimated sales quantity and x denotes the maximum sales quantity) decides 
to buy a product for a given price p 



Prob{p. buy) 



1, if Prob{ex) < 
0, otherwise. 



For comparisons w.r.t. the advantages of market research the negative effects the 
occurring market research costs were integrated. As cost aspects, variable costs 
Cy with 20 euros for each interview and 10,000 euros as fixed costs were assumed. 
Then a kind of “profit” of doing research could be calculated and compared with 
the data which uses no market research. 

• “Without market research” data were generated similar to the “with market 
research” data, but only according to factor 1, factor 4 and factor 5. In contrast 
to the “with market research” data no market research (with groups and respon- 
dents for each group of the Monadic Approach) is considered and therefore factor 
2 as well as factor 3 are irrelevant. Here, only a small group of five experts were 
supposed (e.g., members of the top management), where each of them had to esti- 
mate three price points (what represents three groups in the Monadic Approach). 
Hereby the resulting sales maximum could be computed. 



3.2 Results 

With a full factorial 3^-design with hundredfold replication a total of 24,300 syn- 
thetic datasets were generated and analyzed. For each dataset, the sales maximum 
and the “profit” maximum were calculated. Table 1 shows the results of the sales 
maximum as mean values with respect to both relevant approaches “with market 
research” and “without market research”, t- and F-tests indicate significance of the 
differences w.r.t. methods and factors. Table 2 shows the calculation results for the 
“profif’maximum in the same way. 

The results of Table 1 show a clear superiority of the “with market research” 
approach over the “without market research” approach across the variety of factor 
levels. Furthermore, the results point out which other influencing factors (in addition 
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Table 1 Monte Carlo comparison concerning the impact of the influence factors using mean 
values w.r.t. the sales maximum {n = 24,300 datasets) 



Factor 


Level 


Mean sales maximum 


With 

market research 


Without 
market research 


Overall 


Steepness of the price 


0.5 


123,909*** 


111,198 


117,554 


response function 


1.0 


247 737*** 


223,110 


235,423 




2.0 


495,611*** 


446,412 


471,011*** 


Number of groups 


2 groups 


285,810*** 


260,950 


273,380 


used at the Monadic 


4 groups 


290,451*** 


260,813 


275,632“ 


approach 


6 groups 


290,996*** 


258,957 


274,977 


Number of 


25 respondents 


286,629*** 


259,761 


273,195 


respondents for 


50 respondents 


289,733*** 


261,487 


275,610“ 


each group 


100 respondents 


290,896*** 


259,471 


275,183 


Measurement enor in 


a = 0.1 


289,076*** 


261,039 


275,057“ 


the sales estimation 


ff = 0.2 


289,149*** 


260,539 


274,844 


matrix 


O 

II 

b 


289,033*** 


259,141 


274,087 


Systematical error 


0% 


288,948*** 


259,828 


274,388 


(displace of sales 


5 % 


289,105*** 


260,666 


274,886“ 


estimations) 


10% 


289,204*** 


260,225 


274,714 


Overall 




289,086*** 


260,240 





*** Significant differences within rows (t-Test) and columns (F-test) at the p < 0.001 level, ** at 
the p < 0.01 level, * at the p <0.\ level, ns not significant 



Table 2 Monte Carlo comparison concerning the impact of the influence factors using mean 
values w.r.t. the “profit” maximum {n = 24,300 datasets) 



Factor 


Level 


Mean “profit” maximum 


With 

market research 


Without 
market research 


Overall 


Steepness of the price 


0.5 


109,243 


111,198*** 


110,220 


response function 


1.0 


233,070*** 


223,110 


228,090 




2.0 


480,944*** 


446,412 


463,678*** 


Number of groups 


2 groups 


273,476*** 


260,950 


267,213 


used at the Monadic 


4 groups 


275,785*** 


260,813 


268,299“ 


approach 


6 groups 


273,996*** 


258,957 


266,477 


Number of 


25 respondents 


274,629*** 


259,761 


267,195 


respondents for 


50 respondents 


275,733*** 


261,487 


268,610“ 


each group 


100 respondents 


272,896*** 


259,471 


266,183 


Measurement enor in 


a = 0.1 


274,409*** 


261,039 


267,724“ 


the sales estimation 


a = 0.2 


274,482*** 


260,539 


267,511 


matrix 


ff = 0.3 


274,366*** 


259,141 


266,753 


Systematical error 


0% 


274,282*** 


259,828 


267,055 


(displace of sales 


5 % 


274,438*** 


260,666 


267,552“ 


estimations) 


10% 


274,537*** 


260,225 


267,381 


Overall 




274,419*** 


260,240 





*** Significant differences within rows (t-Test) and columns (F-test) at the p < 0.001 level, ** at 
the p < 0.01 level, * at the < 0. 1 level, ns not significant 
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to the natural influence of the steepness of the price response function) can he 
focused. While we used our estimated data in the real price response function we 
have the fact that higher sales values are better. For example, it can be seen that the 
used Monadic Approach underestimates the price which is paid maximum (what 
leads to smaller sales maximum values). Nevertheless, these results do not consider 
the arising market research costs. 

The results presented in Table 2 show a superiority of the “with market research” 
approach over the “without market research” approach in most cases and even across 
most factor levels. It can bee seen that only the steepness of the price response 
function has an impact of the way of estimating the maximum value. In all over 
cases a decision process without market research leads to worse results. 

Altogether, the results suggest to use a kind of market research, here the Monadic 
Approach for asking real customers (respondents) about their price sensitivity. The 
alternative of doing this “in-house” and without real customers leads to results where 
the company would make wrong price decisions and would lose money. 



4 Conclusion and Outlook 

This contribution focuses on the estimation of price response function which has a 
high relevance in marketing. While this mostly empirical based price response func- 
tions are subject to a lot of disturbing influence factors we analyze the most impor- 
tant ones. The analysis is carried out with synthetically generated and disturbed 
data. 

In our comparison we could show which influence factors have to be considered 
when deciding on prices and which positive impact market research has. Although 
the assumed market analysis costs were chosen rather randomly and therefore not 
representatively, one nevertheless obtains first information about the advantages of 
market research. 
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Real Options in the Assessment 
of New Products 



Said Esber and Daniel Baier 



Abstract For the evaluation of new product development (NPD) and - alterna- 
tively - research & development (R&D) projects, the consideration of technical, 
market, and environmental uncertainties is of highest importance. Such uncertain- 
ties often result from changes in the markets and their environment. In these cases, 
real options assessment can provide a better understanding of the value of a project, 
since this approach allows to model management actions during the products’ life- 
time very flexible and allows to select best project alternatives. This paper describes 
the use of the real options approach in information technology (IT). The applica- 
tion field is the production of a new desktop video-conference system with possible 
product extensions to be developed during lifetime. 

Keywords Marketing research • New product development • Real options. 

1 Introduction 

For the assessment of NPD as well as R&D projects, the net present value (NPV) 
method is a standard capital market-oriented method. Each expected cash inflow or 
outflow of a project is discounted back to its present value, then these values are 
summed giving the NPV of the project. In simple terms, NPD or R&D projects 
with positive NPV should be accepted, those with negative NPV should be rejected. 
Within the framework of strategic decisions - e.g., takeover or fusion of companies 
or NPD and R&D decisions in uncertain environments - this capital market-oriented 
assessment is used even though its massive simplifications in evaluation and inaccu- 
racies of the results are obvious. However, in complex assessment situations where 
flexible management reactions due to project developments have to be taken into 
account, the use of this method has often been criticized for being too inflexible and 
therefore problematic (Copeland & Antikarov, 2001). 
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Besides the problem of predicting future cash flows, also the discount rate is a key 
variable in the evaluation. Typically, the company’s average cost of capital (after tax) 
is used for this purpose. In order to integrate the uncertainties of the project, often the 
expected cash flows can be discounted with a higher - risk-adjusted - rate of inter- 
est. Additionally, scenario analyses can be carried out to determine the project value 
under pessimistic, optimistic, and realistic assumptions (scenarios). Then, the evalu- 
ation results of the different scenarios are summarised in an “expected value” where 
each scenario is weighted with an assumed probability. However, as each scenario 
gathers the project value under certain conditions only, this method is not flexible 
enough when the management has more complex reaction possibilities in the future 
which can have a tremendous influence on the project value (Brealy & Myers, 2000). 
So, e.g., in the course of action the management has a huge variety of rights but no 
obligations to take some additional project investments (e.g., additional new prod- 
ucts, extensions) similar to the owners of financial options in financial markets who 
have the right but not the obligation to sell or buy assets for predeflned prices. 

This flnding was substantiated by Myers in 1977 for the first time when he used 
the term “real options”. Future possible decisions of the management are modeled 
like flnancial options or by using decision trees (Dixit & Pindyck, 1995). However, 
this method is not very wide spread in NPD or R&D practice. Therefore, within the 
scope of this paper, the following research issues are addressed: To what extent is the 
use of the real options approach in NPD or R&D projects possible and reasonable 
and in which way can the real options approach be used in the assessment of new 
products (here: video-conference systems). 



2 Uncertainties in Product Development 

Investments in product development are of prime importance to companies. On 
the one hand, they have a high influence on the flnancial strength of the company 
because they involve high expenditures. On the other hand, these investments rank 
highly because they shall ensure the competitiveness of the company in the future. 
Numerous companies from the sector of technology achieve high portions of their 
revenues with young products that were recently developed. Therefore, a necessar- 
ily adequate flnancial assessment of NPD projects includes first and foremost the 
uncertainties that are involved in product development (e.g., amount of investment 
expenditures, behavior of the competitors). 

These uncertainties make planning and forecasting of the product success com- 
plicated. Therefore, uncertainties have to be considered in the assessment of NPD 
projects. Additionally, the assessment should also include the possibilities of man- 
agement action connected with. In the course of NPD or R&D projects, new 
information can lead to a change of the strategy pursued so far. These two char- 
acteristics are very relevant because there is often a longer period of time between 
the decision in favour of product development and the first payment effected from 
the project. The duration of this period of time (both short and also long) has major 
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influences on the project value. The term “uncertainty” just describes the possibility 
of a deviation from an expected condition. This involves a change in the positive 
case and a danger in the negative one. Strictly speaking, it means a risk as far as 
the availability of information about the probability that different environmental 
conditions might occur is concerned (Brandt, 2002). 

In the financial context, the term “risk” describes the probability of receiving a 
repayment from the investment, which deviates from the originally expected situa- 
tion. This definition includes the risk of loss as well as the chances of winning. Seen 
from the company, it can be differentiated between technical and economic risks. 
Technical risks apply to the prospect of reaching a certain target in the planned 
period of time with the existing resources from a scientific-technical point of view. 
The technical risks include the realisation risk, performance risk, cost uncertain- 
ties and regulatory risks (e.g., certification risks). The economic risks are related 
to the marketing phase. They include price risks, competition risks, rise of me too 
products and the risk of market acceptance of the new product. These economic 
risks can change the planned sales volume and the forecasted customer population 
(Billing, 2002). 

Three possible courses of action are available for the decision makers in order 
to react to the uncertainty problem. The first form of reaction is the conscious and 
unconscious ignorance of uncertainty. The information is held back or cut out of the 
field of the decision makers. This actually happens out of ignorance of the situation 
or with the aim of making the decision easier. The second form of reaction is the 
reduction of uncertainty. In this respect, an improvement of the information basis 
by gathering information at the time when the decision is taken is just the thing 
to do. The third possibility is the acceptance of uncertainty. For this purpose, the 
analysis of the risk structure and the flexibility of the decisions can be carried out. 
As far as the analysis of risk structure is concerned, it is to be distinguished between 
the individual risks, namely which risk can be reduced and which one cannot be 
reduced. The flexibility of the decisions makes it possible that future chances will 
be realizable and the occurring risks reducible (Adam, 1996; Damodaran, 2001). 



3 Real Options in NPD and R&D Projects 

The term “real options” was coined by Myers in 1977 in order to indicate operative 
option possibilities. A real option means future scopes of activity and invest- 
ment possibilities of a company, connected with the ability of the management to 
adjust operative decisions to changed environmental conditions (Hommel & Pritsch, 
1999). Real options have five characteristics. The first one is irreversibility. The 
investment decision must be irreversible. Otherwise, the payment of the investment 
could be fully taken back at any point in time. Thus, the risk would be rigorously 
reduced and the real options would decrease in value. The second characteristic is 
uncertainty. The profitability of an investment is normally uncertain. Otherwise, it 
would be very significant to finally take all future decisions at the beginning of the 
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decision process. The third characteristic is flexibility. The possibility of postponing 
an investment decision must really be an alternative to acting. The fourth charac- 
teristic is exclusivity which means for financial investments that the acquired right 
can be exercised by the options owner only. In the simple monopoly case, it is an 
exclusive option which is available to the respective company only. However, there 
is no exclusivity in perfect competition. The fifth characteristic refers to the access 
to information because delays in investment decisions must be accompanied by an 
improvement of the information level about profitability (Brach, 2003). 

In principle, a distinction is made between learning, growth and insurance 
options. Learning options contribute to reaching a higher level of information for the 
company which learns from them and will then be in a position to take investment 
decisions. The learning options include postponement options (waiting options), 
delay options, and stage investment options (Amram & Kulatilaka, 1999; Copeland 
& Antikarov, 2001). The growth option gives the company the chance of keeping 
and improving its competitive position. The growth options include the extension 
options and innovation options (Brealy & Myers, 2000; Kilka, 1995). The insur- 
ance options enable the company to react to unfavorable market developments (e.g., 
reduction of the production volume). Therefore, the insurance options are important 
to the risk management of a company. They include the capacity change options, 
breaking-off options and readjustment options (Mostowfi, 2000; Trigeorgis, 1996). 

For the assessment of whether an investment makes sense or not, the assessment 
method should consider some criteria which distinguish a good decision rule from 
all others, such as guarantee of maximisation of the company value (the assumed 
projects should increase the company value; for this reason, the present value of 
the expected cash flows is to be calculated), presentation of the uncertainties and 
flexibility, consideration of the irreversibility in terms of investments that have 
already been realised and applicability to various projects and investment projects, 
respectively. The real options method fulfils these criteria to their full extent. 



4 Real Options Assessment Using Excel Based Tools 

In R&D management, decision-supporting, (Microsoft) Excel based tools are very 
reasonable and useful because they offer a better understanding of the problem struc- 
ture and a deep insight into the background of each decision. These tools can be used 
for decision analysis in uncertain environments as well as for the analysis of decision 
trees. In economic and financial fields. Excel is used as a decision instrument. Two 
Excel add-ons (Precision Tree 5.0 and @Risk 4.5 from Palisade in Ithaca, NY) can 
be used in order to present decision trees, analyse the economic and technical risks 
and make sensitivity results visible (Rese & Baier, 2007). Precision Tree is a Deci- 
sion Support Software which helps to choose alternative business decisions, such as 
making decisions on introducing new products, factoring in decisions at each stage 
of marketing and production. Precision Tree offers the required tools for presenta- 
tion and analysis of decision trees. The decision trees are simply to be presented 
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in an Excel Spreadsheet, because, first of all, the cells are selected in the spread- 
sheet and then different node types (chance nodes or decision nodes) are introduced. 
Afterwards, the probabilities and profits are input into the cells of the tree. The best 
decision to be taken is determined at each node, and the branch is marked with a 
TRUE label. In the following, an application of such tools is described in the case 
of the development of the video-conference system BRAVIS (Brandenburg Video 
Conferencing System). The application was done with real financial values which 
are modified in this section for confidentiality reasons. Video-conference means a 
general form of communication in which people talk to and see each other at the 
same time even though they are not sitting in the same room. Video-conference sys- 
tems are used in various fields, such as tele-teaching, procurement, management, 
R&D, medical consultations, business deliberations, marketing and sales, customer 
service and private entertainment. Using video-conference systems, the company 
will be in a position to get strategic time advantages, to improve the quality of dis- 
cussions and to save various costs (e.g., costs from face-to-face discussions and 
business trips, costs of phone calls). The video-conference system BRAVIS has 
been developed at the Chair for Computer Networks and Communication Systems 
at the Brandenburg University of Technology (BTU) Cottbus. BRAVIS has been 
developed within the framework of a federal state project and supports tele-teaching 
applications, such as tele-seminars, tele-consultations, examinations and special lec- 
tures (Ziihlke, 2004). In 2005, the BRAVIS activities lead to a spin-off from BTU 
Cottbus (http://www.bravis.eu). The company already has already been successful 
in acquiring venture capital (from the High Tech Griinderfonds and from Branden- 
burg Venture Capital), consequently financial evaluations of the company and its 
NPD and R&D projects are of highest importance. 

Eigure 1 shows a decision tree presented by Precision Tree. The R&D manage- 
ment has to decide on the development of a successor BRAVIS 2 for an already 
established BRAVIS I system. The R&D management has to think about this intro- 
duction in the context of the option to further introduce a third version of this system, 
BRAVIS 3, in dependence on the market acceptance with regard to the developments 
of the first and second BRAVIS system version (with high or low revenues). Real 
options (extension options) offer the expected probabilities for the branches at the 
chance node (of high or low revenues after the introduction of BRAVIS 2) and the 
expected profits for all terminal nodes of the decision tree. Real options make an 
additional value available to the management. If the expected profits were analysed 
without considering the possibility of a later introduction of BRAVIS 3, BRAVIS 2 
would not be introduced. However, if the possible future introduction of the sys- 
tem of BRAVIS 3 was considered in accounting the expected profits, the system of 
BRAVIS 2 would be introduced. These calculated decisions will be marked with 
(TRUE) or (FALSE) in the decision tree. With @Risk, an additional risk analysis 
add-on, an analysis of the dynamic sensitivity can be conducted. Uncertain events 
are defined using continuous probability distribution functions. Figure 1 already 
shows the calculation of the profits at the terminal nodes of the tree. The deci- 
sion about the development and market introduction of BRAVIS depends on the 
price of BRAVIS (p), on variable costs (c) (identical prices and costs for all three 
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BRAVIS systems are taken as a basis), on the sales volume of BRAVIS 1, 2 and 3 
(for BRAVIS 1 the sales are assumed as x “low turnover”, and in the introduction of 
BRAVIS 2 and 3, possible additional sales for BRAVIS 2 are modelled as xl and for 
BRAVIS 3 as x2), on the R(&D costs for the development of the BRAVIS-system (C) 
and on the probability P that, after the introduction of BRAVIS 2 and 3, higher sales 
will be achieved. Figure 1 also shows the calculation of the prohts for each terminal 
node of the decision tree. As far as a decision in favor of the introduction of BRAVIS 
2 is concerned, some estimates can be made in such a way. For example, the prob- 
ability of realising higher sales is assumed to he equally distributed between 20% 
and 40%. The possible additional sales of BRAVIS 3 are assumed to be distributed 
on average of 50,000 systems and a standard deviation of 15,000. Thus, a Monte 
Carlo simulation of the decision tree can be carried out. Here, multiple scenarios 
of a model are calculated by repeatedly sampling values from all known probability 
distributions for the uncertain variables. ©Risk’s simulation can consist of hundreds 
or thousands trials (or scenarios) per second. During a single trial, ©Risk randomly 
selects a value from the dehned possibilities (the range and shape of the distribution) 
for each uncertain variable and then recalculates the spreadsheet. Figure 2 shows the 
distribution of the profit as a result of this simulation. Moreover, ©Risk offers an 
additional extended analyses for the determination of the critical factors: sensitivity 
analysis. Figure 3 shows that this tool can be used for the classification of uncertain 
input factors in compliance with their influence on the simulation results. 
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Fig. 2 Distribution of the profits according to a Monte Carlo simulation 
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Fig. 3 Sensitivity analysis for the BRAVIS 2 NPD project 



5 Conclusions and Outlook 

The paper shows that the use of the real options approach in NPD and R&D is pos- 
sible and reasonable. The consideration of the real options method is very useful for 
modeling technical uncertainties and those which have been caused by the market. 
In the course of the project, the knowledge increases so that the managers can make 
better decisions (postponement or break-off of projects) using the acquired informa- 
tion at the point in time when the decision is to be taken. The real options method 
models this flexibility. 
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Exploring the Interaction Structure of Weblogs 



Martin Klaus and Ralf Wagner 



Abstract Weblogs (short blogs) are fundamentally changing our way of commu- 
nication, a phenomenon which has led to the creation of a new type of social 
interaction. One interesting feature of this virtual communication is that it provides 
reference to other blogs by setting hyperlinks in the course of dialog. 

In this study, we combine different approaches of crawling blogs and subse- 
quently use social network analysis to uncover the blog structure. We introduce 
quantitative assessments of the revealed structure and highlight the relevance for 
direct marketing communication. 

Keywords Marketing communication • Social network analysis ■ Web crawler • 
Weblog. 



1 Introduction 

The core of modern marketing concepts (e.g., interactive marketing or dialog mar- 
keting) is a personalized contact with customers. These activities require up-to-date 
and detailed information at the individual level. Two distinct qualities of informa- 
tion are relevant: transactional and non-transactional. Transactional information via 
purchase histories and cross-buying behavior can be obtained from bonus card pro- 
grams or loyalty programs. However, non-transactional customer information such 
as interests, beliefs, values, opinions, future purchase intentions, and lifestyle char- 
acteristics are not discernable from transactional databases. Radford (2004) provides 
empirical evidence that non-transactional information is more relevant to the success 
of sophisticated marketing communication measures than transactional information. 
Moreover, Robertshaw and Marr (2005) conclude that non-transactional informa- 
tion does not map to demographic variables. They argue the common practice of 
obtaining information for lifestyle segmentation introduces a bias toward particular 
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consumer types. Interestingly, consumers are concerned about the use and the possi- 
ble abuse of their transactional information in bricks-and-mortar business, e.g., price 
discrimination, as well as in e-commerce and m-commerce, e.g., identity theft or 
e-mail spamming. Nevertheless, they volunteer to disclose non-transactional infor- 
mation by generating and maintaining Web 2.0 contents. This non-transactional 
information is divided into two qualities: Data mining techniques enable the identifi- 
cation a person’s interests and opinions by exploiting his or her contributions, such 
as blog entries (Glance, Hurst, Nigam, Siegler, Stockton, et al., 2005). The other 
quality is provided by the relationship of a person to other persons or groups of per- 
sons as well as links to topics. The latter aims to preselect individuals for canvassing 
customers or to identify opinion leaders for triggering a word-of-mouth communi- 
cation process. Both data acquisition and data quality pose major challenges (Zheng 
& Padmanabhan, 2006). Seizing the challenge of identifying the relevant knots in a 
communication network, this paper aims to: 

• Outline the different aggregation levels in the analysis of the blogosphere arising 
from the considered layer of interactions 

• Propose the ego-network as a useful basic unit of investigation for studies aiming 
to improve marketing communication 

• Introduce quantitative assessments for the appraisal of knots in a communication 
network 

The remainder of this paper is structured as follows: Sect. 2 comprises an 
overview of the quality edges in the blogosphere and provides phenomenological 
insights into the aggregation levels of analysis. Quantitative assessments for the 
communication networks are also discussed in this section. An empirical example 
is used in Sect. 3 to demonstrate the structure analysis in the blogosphere as well 
as the identification of relevant ego networks. In the last section, we discuss the 
implications for marketing communication and draw final conclusions. 



2 Identifying Blogs on the WWW 
2.1 Social Networks of Blogs 

The multiple nature of blogs, unrestrained choice of topics and diffusion of the 
medium evoke substantive interest from scientists and business practitioners. Blogs 
are interconnected via a huge network (blogosphere). Selected parts of this network 
concerning topics, brands or products need to be analyzed to make this modern 
communication channel usable for direct marketing. Crawling the blog link structure 
is more difficult than crawling the web structure because not all the links of a blog 
convey a connection to another blog. Because bloggers (blog authors) use links 
very frequently, they develop a special linking system within the blogosphere, from 
which a unique social network has arisen Leskovec et al. (2007). 
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Fig. 1 Graph representations of blog networks (modified from Marlow (2004)) 



We aim to identify “important” blogs within the network by social network anal- 
ysis. The blogosphere is a directed graph where the actors are blogs and the relations 
are links. The edges of this graph are defined as four qualities (Marlow, 2004): 



Blogrolls: 

Permalinks: 



Comments: 

Trackbacks: 



List of links to other blogs which the author reads regularly. 

Links which lead to a specific blog entry, but not to the main page 
of a blog. These links enable a communication through the whole 
blogosphere. 

Are the most simplest way to communicate via a blog. A reader just 
leaves a comment in response to a post or a further comment. 

This unique blog function provides the author of a blog with the 
information that his post was linked by the other blogger. Thus, 
trackbacks enable blogger to monitor their impact by means of the 
echo which their posts create in the blogosphere. 



These diverse functions of links result in redundant edges of the blog network. 
Therefore, multiple outgoing edges from one blog to another have a meaningful 
interpretation, as depicted in Fig. la. 

First, the links between two blogs are summarized to weighted edges with the 
weights equal to the number of combined links and the comment links are neglected. 
The result of this merging is the blog network depicted in Fig. lb. Alternatively, the 
links between the posts on blogs are considered and the blogrolls are neglected. 
In this process, individual posts are the knots of the graph, as depicted in Fig. Ic. 
For our empirical application, we use the blog network depicted in Fig. lb. The 
major challenge for marketers is the assessment of relevance of the blogs for the 
communication process of the whole network. 



2.2 Assessment of Egos and Ego Networks 

Different centrality measures have been proposed for the identification of “impor- 
tant” blogs (Everett & Borgatti, 1999). The degree centrality (Cd( )) provides an 



548 



M. Klaus and R. Wagner 



impression of the structure of the network by considering the number of connections 

from one knot i (i = 1 /) to other knots j (j = 1 / — 1 , / + 1 , . . . , 7) of 

the network: 

Hi 

Cdiki) = V t, (1) 

with ki denoting the knot i and I denoting the total number of knots in the net- 
work. In this study, the degree centrality is deemed to be the dimension of possible 
communication activity within the network. The more links a blog has, the higher is 
the probability of direct communication with other bloggers. Thus, we assess how 
applicative blogs are to start canvassing on these blogs with a high degree centrality. 

The closeness centrality (Cc(-)) provides an impression of a blog’s centrality in 
relation to other blogs. 



= w — J7 , — TT ^ ^ J 

J2j = i d{ki,kj) 

with d(ki,kj) denoting the number of edges between the knot pair (/, j). In our 
application domain, the closeness centrality is deemed to be the dimension for inde- 
pendence from other blogs because the higher closeness centrality of a blog is, the 
more direct connections are linked to it. Thus, a blog is less dependent on another 
blog if it has many others close by. Moreover, this measure is assessed as the effi- 
ciency of a blog in all the other knots within the network. Considering the distance 
from one blog to all other blogs in the graph, the closeness centrality indicates how 
fast a marketing communication measure could spread through the network, starting 
at blog i . 

The betweenness centrality considers the shortest distances within the graph. 

= (/_!).(/_ 2) ^ ' and j,l ^ I (3) 

with gji denoting the number of geodesics and gji{kj) denoting the number of 
geodesics through k, . In this study the betweenness centrality assesses the oppor- 
tunities for controlling the communication process. If many shortest distances run 
over a blog, it has a high influence on the network communication, assuming the 
blogger usually uses the shortest way to communicate. In this way, communica- 
tion from these blogs can be monitored and assessed by marketers with a view of 
influencing them as they wish. 

Each blog is also assessed by its ego network which comprises a single actor 
(ego), the actors that are connected to it (alters), and all the links among those alters 
(Everett & Borgatti, 2005). Ego networks mostly include a lot of strong and only a 
few weak ties. Strong ties are related in a network to others which are very tight, 
like long and close friendships. These strong tie networks tend to be more “socially 
close”. In other words, out-links are restricted to other blogs discussing a specific 
topic which tend to create a connected group or community. In contrast, the weak 
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ties are “loose contacts”, connecting various topic- and/or people-subgroups. This 
means they are connected to more than one network community what make them 
to bridges between different networks. The blogosphere makes up a joint network 
of many communities. Here, the weak ties are more challenging for the structure 
analysis because these links knit together with sub-networks of different commu- 
nities or topic domains. Thus, the weak ties are relevant for spreading information 
throughout a network. Without weak ties, any information is likely to stick in sub- 
networks. Thus, the larger an ego network is, the more alters it has - these alters do 
not know, or barely know, one another - and the more different the alters are in rela- 
tion to other criteria, the more powerfully this ego can distribute information. The 
mobilizing impact and the influence on other blogs increases with the cardinality 
of the ego network n{ki) = Xji. The density of the ego network is the relation 
between the existing ties and the possible pairs 

V , V Xij 

d{ki) = ^ ^ V i 7^ j. (4) 

Blogs which seem to be “important” in the examined network because they are 
surrounded by a strong and dens ego network and show high centrality measure are 
important for marketing purposes because of two facts. First they act as multipli- 
cator (Katz & Lazarsfeld, 1955) of information within the network. Secondly, not 
high involved blogger often have long actualization breaks on their blogs whereas 
high involved blogs post in a very frequent number which guaranties reader traffic, 
actuality and with this influence. These properties can be used well by marketers 
to spread their ideas, campaigns, products or names by contacting the “important” 
blogger and leaving post or doing common banner commercial on its blogs. 



3 Empirical Application 

For this study, the top 40 blogs according to the ranking by blogrollin.com 
(07-03-2008) are considered. The web crawler SocSciBot provided by Thelwall 
(2004), collected the blog URLs, its connections (links) and the blogs’ contents. 
The crawler started at all 40 blogs and crawled to a depth of 15,000 pages for each 
URL. Table 1 outlines a cut-out of the blogs with its in- and out-links at the different 
aggregation levels depicted in the table. The page column in the table provides the 
number of links found in the web pages of each crawl. The directory is a pair of 
crawled sites. The number of links, counted directory-by-directory is the number of 
pairs of directories with a link from the first to the second. This site only gives the 
number of direct links between the crawled sites. Therefore, for the analysis of the 
blog network, the domain IN and OUT columns, which give the number for each 
pair of crawled sites, were needed. The whole network’s graph is depicted in Fig. 2. 

Obviously, The central and interacting blogs are closer together in the middle 
of the graph, and, the less connected blogs fade out to the border. Additionally 
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Table 1 Extract of the collected blogs concerning the IN and OUT links for pages, directories, 
domains and sites 



Name 


Page 

IN 


Dir 

IN 


Dom 

IN 


Site 

IN 


Page 

OUT 


Dir 

OUT 


Dom 

OUT 


Site 

OUT 


andrewsullivan.com 


272 


45 


13 


11 


0 


0 


0 


0 


arstechnica.com 


1,013 


721 


60 


18 


0 


0 


0 


0 


boingboing.net 


14,760 


1,754 


50 


22 


643 


589 


73 


21 


dailykos.com 


1,576 


1, 190 


49 


14 


0 


0 


0 


0 


fark.com 


34, 858 


9,510 


151 


16 


85 


25 


3 


3 


icanhascheezburger.com 


176 


74 


22 


13 


308 


48 


7 


5 


imao.us 


11.286 


9, 058 


7 


7 


18 


10 


5 


5 


instapundit.com 


31,303 


9,366 


16 


14 


35 


4 


4 


4 




Fig. 2 Social network of the top 40 blogs 



considered for the position of the blogs in the net the number of the directed 
links which, for reasons of legibility, are not shown in the graph. Visual inspec- 
tion provides the impression that blogs such as metafilter.com, gizmodo.com and 
boingboing.net could have promising ego networks. The centrality measures and 
the IN and OUT links for a subset of the top 40 blogs are listed in Table 2. 
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Table 2 Extract of the collected blogs with its IN and OUT links and some centrality measures 



Link name 


IN links 


OUT links 


Cb 


Cc 


Q 


atrios .blogspot .com 


15 


0 


0 


58.11 


15 


beppegrillo.it 


1 


1 


0 


43.88 


1 


boingboing.net 


50 


73 


342.69 


76.79 


104 


dailykos.com 


49 


0 


0 


58.90 


49 


fark.com 


151 


3 


31.53 


61.43 


151 


hughhewitt.com 


6 


0 


0 


50.00 


6 


icanhascheezburger.com 


22 


7 


7.83 


62.32 


28 



Max 


151 


579 


342.69 


78.18 


579 


Min 


0 


2 


0 


40.57 


1 


Mean 






55.16 


58.52 


59.14 


Median 






24.60 


58.90 


27 



The boingboing.net is highlighted because it has the highest hetweenness of 
all considered blogs. Thus, this blog appears to be suited to control the communi- 
cation process. Moreover, boingboing.net has a comparatively high closeness of 
Cc(boingboing.net) = 76,786. Because of its centrality and independence, this 
blog provides an efficient knot of origin for marketing communication measures in 
the blog network under consideration. Furthermore, this recommendation is sup- 
ported by boingboing. net’s degree centrality of O (boingboing.net) = 104,000, 
which exceeds the arithmetic mean of the degree centrality. The ego network of 
boingboing.net connects with 30 of the other 39 observed blogs, and there are 206 
ties in this ego network. The density of the network is c/(boingboing.net) = 23.68, 
which was a higher average value compared to the other ego networks. This means 
that a strong ego network has been identified. The strong properties of boingbo- 
ing. net could now be used well for marketing purposes to spread ideas, campaigns, 
products or names by contacting the “important” blogger and leaving post or doing 
common banner commercial on this blogs. 



4 Conclusions and Future Work 

This study aims to identify blogs suited for triggering and to controlling a marketing 
communication process on the WWW2.0. Therefore we analyze the structure of the 
blog-communication network and assess each blog’s position within this network. 
It turned out that counting the IN-links and OUT-links at the domain-level fits the 
need of direct marketers. However, our results are restricted to the task of identifying 
blogs for triggering its discussion but should not be generalized to media-selection 
in digital advertising planning because the number of visitors and readers is not 
considered in this study. 
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Different measures for quantifying the individual blog’s position in the commu- 
nication network have been discussed with respect to their usefulness and interpre- 
tation from the direct marketer’s point of view. The empirical part of this paper 
demonstrates the application where the visual impression of the blog’s position is 
confirmed by the quantitative assessments. Promising avenues of further research 
are a dynamic consideration by including the changes of the communication net- 
work and differentiating the older from recent interactions. Moreover, sensitivity 
analysis might provide a quality evaluation of the quantitative assessments discussed 
in this paper. 
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Analyzing Preference Rankings 
when There Are Too Many Alternatives 

Kar Yin Lam, Alex Koning, and Philip Hans Franses 



Abstract Consumer preferences can be measured by rankings of alternatives. 
When there are too many alternatives, this consumer task becomes complex. One 
option is to have consumers rank only a subset of the available alternatives. This has 
an impact on subsequent statistical analysis, as now a large amount of ties is 
observed. We propose a simple methodology to perform proper statistical analy- 
sis in this case. It also allows to test whether (parts of the) rankings are random or 
not. An illustration shows its ease of application. 

Keywords Multiple comparisons • Rankings ■ Ties. 



1 Introduction and Motivation 

Stated consumer preferences can he measured by rankings of alternatives. Rankings 
are easy to collect, they are easy to interpret, and various statistical tools to evaluate 
the randomness of observed rankings have been developed. Interestingly, such tools 
have not yet been fully developed and analyzed for the case where consumers would 
evaluate only a subset of the alternatives, and in the particular interesting case where 
not all consumers evaluate the same subset of alternatives. 

This situation, which we address in the current paper, has become increasingly 
important these days in marketing and consumer research, as it has been widely 
recognized that individuals face difficulties, or even become dissatisfied, when hav- 
ing to evaluate and compare too many choice options. Iyengar and Lepper (2000) in 
their famous experiment showed that too much choice can be de-motivating for con- 
sumers. In a marketing context, Boatwright and Nunes (2001) demonstrated that a 
reduction of the assortment is in fact felt as beneficial to consumers, a finding which 
has been supported by Chernev (2003) and Gourville and Soman (2005), among 
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Others. In another context, Deshazo and Fermo (2002) showed that consumers expe- 
rience task complexity when choice options are plenty, see also Sandor and Franses 
(2009). In sum, consumers may find it difficult to rank preferences across too large 
an amount of alternatives. 

In the case of having many alternatives, a simple solution would now be to ask 
consumers only to evaluate a subset of these alternatives. A consequence of this lib- 
erty is that subsequent statistical analysis becomes more complicated as the potential 
number of ties becomes (much) larger. 

There are many alternative inferential approaches which could be discussed, 
see for example the overview given in Marden (1995). However, these alternative 
approaches are usually technically advanced, which may explain their slow adapta- 
tion. In this paper we advocate a simple and easily understandable methodology. 

The outline of our paper is as follows. In Sect. 2 we give a few preliminaries 
to sketch the problem at hand. In Sect. 3 we outline the statistical methodology. 
In Sect. 4 we illustrate its relevance and ease of use on the observed rankings of ten 
blockbuster movies where respondents were asked only to rank four of these. Sect. 5 
concludes with a few avenues for further research. 



2 Preliminaries 

Preference rankings are a common tool in consumer surveys. Various studies indi- 
cate that the task of comparing all k objects or products could be too difficult. Even 
if the respondent would complete the task and ranks all products, the discriminating 
power may be doubtful as the ability of respondents to rank alternatives effectively 
and reliably is a function of the number of comparisons to be made. 

The problem of task complexity can be alleviated by asking respondents to evalu- 
ate all k alternatives but only to give preference rankings for a subset s, with s < k, 
that is selected by each respondent. Hence, each respondent can have a different 
subset s. 

A second issue of our interest is to statistically test observed preference rankings. 
The corresponding null hypothesis of interest is then 

Hq: There are no differences across the alternatives. Each arrangement of the k ranks 
is equally likely, 

while the alternative is 

Hi : At least one alternative tends to yield a higher ranking than at least one other 
alternative. 

The null hypothesis implies that the underlying distributions for each alternative for 
each respondent are the same. The null hypothesis is rejected at the significance 
level Q! if a test statistic T exceeds the (1 — ajth quantile of a chi-square random 
variable with A: — 1 degrees of freedom. In the most basic model, where we have one 
observation per respondent and each respondent ranks all alternatives, the Friedman 
test statistic Friedman (1937) is appropriate and most commonly used. 
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However, in our case where we ask respondents to evaluate k alternatives but only 
give preference rankings for a subset s, that is selected by each respondent, matters 
are different. We assume that respondents are indifferent to alternatives outside their 
own subset and these alternatives thus receive equal rank. As a consequence, we 
must deal with a substantial number of ties. 

We aim to propose a statistical method that can handle this situation and we pro- 
pose an appropriate test statistic to examine if observed rankings imply statistically 
significant differences across the alternatives. If there is statistical evidence of such 
differences, the final question is of course which alternatives it concerns. To this 
end, we apply multiple comparison procedures to test which alternatives are signifi- 
cantly different from others and in this way we can construct homogeneous subsets 
with alternatives that have equal rankings. 



3 Methodology 

Suppose there are k alternatives, with j = \, ... ,k, and that there are n respon- 
dents, with i = \, ... ,n, who are asked to indicate their top s alternatives and to 
assign ranks to the alternatives in this subset, where the most preferred alternative 
gets rank value 1 and the least preferred gets rank s. We assume that respondents are 
indifferent between the k—s alternatives outside this subset, which are thus observed 
as ties. Denote the observed rankings of respondent i for alternative j by Xjj . Let 
R {xn, . . . , Xik) be the set of all possible rankings consistent with Xn, ... , Xit given 
the ties for respondent i . The weighted rank ?ij assigned to is defined as the 
average of all possible ranks within the set R (xn, . . . , Xik). In Wittkowski (1988) 
it is shown that the weighted rank for tied ranks is equal to the average of the ranks 
“available” for those tied ranks. Adjusted (centered-weighted) ranks aij are obtained 
by subtracting the expected score (k + l)/2 under Hq. 

In Table 1 we illustrate the computation of weighted and adjusted ranks for a 
respondent who has indicated her top s = 3 alternatives from a set of k = 6. 
This respondent has indicated alternative C as most preferred, alternative A is next 
preferred and alternative D is least preferred. We assume that this respondent is 
indifferent to alternatives B. E and F. Ranks within subset {A. C. D) are fixed. 



Table 1 Computation of weighted and adjusted ranks for a hypothetical dataset with s = 3, k = 6 
and one respondent 



Alternative Xij 






R (x, i 


, . . . , Xjfi) 




Weighted rank 


Adjusted rank 


A 


2 


2 


2 


2 


2 


2 


2 


12/6 = 2 


-1.5 


B 


- 


4 


4 


5 


5 


6 


6 


30/6 = 5 


1.5 


C 


1 


1 


1 


1 


1 


1 


1 


6/6= 1 


-2.5 


D 


3 


3 


3 


3 


3 


3 


3 


18/6 = 3 


-0.5 


E 


- 


5 


6 


4 


4 


6 


5 


30/6 = 5 


1.5 


F 


- 


6 


5 


6 


5 


4 


4 


30/6 = 5 


1.5 
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however ranks assigned to alternatives outside this subset could be rank 4, 5 or 6. 
The set of all possible rankings is listed in Table 1 . It then follows that the weighted 
rank for an alternative equals the average of the “available” ranks for that alternative. 
Available ranks for alternatives outside the subset are 4, 5 and 6 and thus their tied 
rank equals 5. The adjusted ranks are obtained by subtracting the expected score 
under Hq, which in this case is 3.5, from the weighted ranks, see the last column of 
Table 1. 

The test statistic for Hq is based on sums of adjusted ranks and the individual 
covariance matrix Vi, which is given in Wittkowski (1988) as 

Vi = ^0,/ (^diag(t) - , (1) 

where i is a Ar-dimensional vector of ones, diag(t) denotes a diagonal matrix with 
elements 1 on the diagonal and where denotes the (conditional) individual 
variance under Hq with correction for ties, which is 

^2 k-s-\k-sk-s-\-l\ 

“ 12 \ k k+ 1 )' ^ ^ 

This individual variance A^- is the same for all i, as the sum of the (squared) 
adjusted ranks for each respondent is the same, which of course is a crucial aspect 
in our methodology. Since each respondent only has to assign ranks to s alterna- 
tives, and alternatives outside this subset all receive the same tied rank, this yields 
the same sum of ranks for each respondent. In our example A^- = 3.1 for each 
respondent, see Table 2. 

The individual variance Ag ■ is instrumental in computing the individual covari- 
ance matrix T). Note that (1) implies that the diagonal elements of V are given by 
Ag , (k — l)/k and the off-diagonal elements by — Ag ■/ k. Hence, the diagonal ele- 
ments of (1) are 1 — A: times those of the off-diagonal elements. One can observe 
that Vi only depends on k and Ag , , and both are the same for each respondent. As a 
consequence, V is also the same for each respondent. The computation of Ag ■ and 
Vi in our example is given in Table 2. 



Table 2 Example continued. Computation of conditional individual covariance matrix 



,2 _ 6 ( 6 + 1 ) 
^0,i 12 



x(l — |xjx? 



+1 _ M 

7 / 10 




V 



Analyzing Preference Rankings when There Are Too Many Alternatives 



557 



The adjusted ranks can be summarized in a vector a + by summing all a, over the 
n respondents, that is, the jth element of vector a+ is given by a+j = citj . 

The random vectors a, are independent under Hq and the covariance matrix T+ of 
the vector a+ is thus equal to the sum of the individual covariance matrices Vi. As 
the Vi is the same for all i, T+ is obtained by multiplying V by the number of 
respondents n . 

As aik in each i can be expressed in terms of an, , ai(k-i), it follows that V+ 
is not of full rank but has rank k — I instead. 

For large n, a+ approximately has a multivariate normal distribution with 
zero expectation vector and covariance matrix V+ under the null hypothesis Hq 
Wittkowski (1988). 



3.1 Test Statistic 

The quadratic-form test statistic is now computed along standard lines as 



where (F+) denotes a generalized inverse of F+, that is, any matrix which satisfies 
V+ iV+)~ V+ = V+. Below we will use the Moore-Penrose generalized inverse 
of F+. 

Under Ho and for large n, W in (3) has approximately a chi-square distribution 
with k—l degrees of freedom. Recall that A: — 1 is the rank of V+. If the correspond- 
ing /?-value is below the a% significance level, the null hypothesis Ho is rejected 
and there is sufficient evidence to conclude that there exists difference in preference 
rankings between the k alternatives. 



3.2 Multiple Comparisons 

When Ho is rejected, we conclude that there exist differences between the alter- 
natives. However, we do not know which alternatives differ in terms of preference 
rankings and so we resort to a multiple comparison procedure to make decisions 
about differences between all k{k — l)/2 pairs of alternatives. Note that multi- 
ple comparisons are only of interest if the global null hypothesis Ho is rejected 
as when Ho is not rejected, it is generally agreed that all hypotheses implied by that 
hypothesis (its “components”) must also be considered as not rejected. 

Denote the average adjusted rank over the respondents for alternative j by 



W = a'+{V+)~a+, 



(3) 
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The null hypothesis Ho jj* for each pairwise comparison of no difference between 
alternative j and alternative j * is rejected if 

\a.j (4) 

where the critical value is chosen to make the type I error rate equal to a. That 
is, is the largest constant such that 

Puf, {{maxa.j) - (mina.y) > < a. (5) 

This implies that when Hq is true all k{k — 1)/ 2 inequalities in (4) fail to exceed the 
critical constant with probability a. 

When Hq is true, one may show that the covariance matrix of the difference 
a.j — a.j* is the same as the covariance matrix of the differences Zj — Zj*, where 
Zj , Zj * are independent random variables with mean zero and variance Aq J n, 
with j, j* = \, ... ,k. Thus, the asymptotic distribution of 

maxjj* \a.j-a.j*\ 

^o,i/Vn 



coincides with the distribution of the range Qk.oo of k independent standard normal 
random variables. It then follows that when n is large, the critical value can be 
approximated by 




( 6 ) 



where is the upper a percentile point of the range of k independent standard 
normal random variables. The critical points can be found in Harter (1960). 



3.3 Rank Plots 

To visualize the results of the multiple comparison procedures, we can construct for 
each j an interval Qj centered at a.j with length and endpoints 



When we observe that interval Qj and Qj* do not overlap, the distance between 
a.j and a.j* exceeds and hence (4) should be rejected, yielding the conclusion 
that there is a significant difference in rank between alternative j and alternative 
j* . The rank plot simultaneously displays the intervals Q\, . . . , Qt. Moreover, a 
reference line can be drawn at the height of the upper boundary of the interval of the 
“most preferred” alternative, which naturally has the lowest upper boundary. This 
reference line in fact visualizes the “unconstrained multiple comparison procedure 
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with the best, deducted from all-pairwise comparisons” as described in Hsu (1996, 
Sect. 4. 2. 1.1). This implies that all alternatives with intervals above this reference 
line are rated significantly lower than the best ranked alternative. 



3.4 Homogeneous Subsets 

Finally, based on the multiple comparisons results, we can form homogeneous 
groups of alternatives by performing a cluster analysis. The clustering algorithm 
we prefer is the complete linkage clustering (see for example Lattin, Carroll, and 
Green 2003), where the maximum distance between elements of each cluster is 
used. We construct a distance matrix which summarizes the significance tests (4) 
by zero’s (non rejection) and ones (rejection). It is well known that such a dis- 
tance matrix of zero’s and ones could lead to multiple solutions in complete linkage 
clustering, so when alternatives j and j * are not significantly different we add the 
difference a.j — a.j* in the distance matrix. However, to avoid that the multiple 
comparisons results are disguised by this, we multiply the difference a.j — a.j* by 
€, with € small enough like 0.001, before adding it up to the distance matrix. Then, 
the cluster analysis is based on the multiple comparison procedure. 



4 Illustration 

In this section we illustrate our statistical methodology to analyze preference rank- 
ings where individuals are asked to rank just s of the total k alternatives. 



4.1 Data 

We illustrate our proposed methodology with data of n = 93 individuals who are 
asked to evaluate a list of k = 10 blockbuster movies in Dutch cinema theatres in 
2007. The movies are listed in Table 3. 

Respondents are asked to indicate their top s = 4 movies. Movies outside this 
subset are observed as ties and their weighted rank is the average of the available 
ranks, that is 7.5. The adjusted ranks of individual i for each movie is obtained by 
subtracting the expected score 5.5 from the weighted ranks. As takes all val- 
ues from 1 to and 7.5 for tied ranks for each respondent, the sum of the adjusted 
scores is the same for each respondent and, as a consequence, the individual vari- 
ance and individual covariance matrix are the same for all i . These are respectively 
Aq - = 65/9andIj' isgivenbyamatrix with Ag-9/lOon the diagonal and— Ag./lO 
on the off-diagonal elements. To compute the test statistic (3) we sum the adjusted 
ranks over all 93 individuals and a+j for each movie is given in the second column 
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Table 3 List of A: = 10 blockbuster movies in Dutch cinema theatres in 2007, ranked in the first 



column according to the total size of audience 



Name movie 


Sum adjusted ranks 


Average weighted ranks 


1 . Pirates of the Caribbean: 


-79.5 


4.645 


At World’s End 


2. Harry Potter and 


1.0 


5.511 


the order of the Phoenix 


3. Alles is Liefde 


40.5 


5.936 


4. Shrek the Third 


-6.0 


5.435 


5. Mr. Bean’s Holiday 


150.0 


7.113 


6. Ratatouille 


1.5 


5.516 


7. Ocean’s Thirteen 


-66.5 


4.785 


8. Spider-Man 3 


76.5 


6.323 


9. Transformers 


5.5 


5.559 


to. The Bourne Ultimatum 


-123.0 


4.177 



of Table 3. In the same way, we compute the conditional covariance matrix V+ by 
summing all individual covariance matrices Vi . Recall that this V+ has rank 9. 



4.2 Results 

Our null hypothesis Hq of no differences between the movies is clearly rejected as 
the value of the test statistic (3) takes the value W = 83.276 with corresponding 
/?-value is 0. We conclude that there is sufficient evidence that there exists a 
difference between the movies. 

As Hq is rejected, the question remains which movies differ in terms of pref- 
erences and so we perform multiple comparisons between all pairs of movies. For 
a = 0.05 and k = 10, q'j^ takes the value 4.474. The critical value can now be 
approximated by (6) and it takes the value 1.247. 

To visualize the results of the multiple comparison procedure, we construct rank 
plots. For convenience the a.j values are retranslated into the average weighted 
ranks r.j, that is i Dy . As we have already calculated the sum of the adjusted 
ranks for each movie, we can also compute this average rank r.j by adding the 
expected score to the average adjusted score of movie j , that is 

k + \ 

r-j = a.j -b 

The average weighted ranks are given in the last column of Table 3. The interval Qj 
for movie j is then centered at r.j and has length r^^. Hence, the endpoints of Qj 
are given by 
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Fig. 1 Rank plot with a = 0.05 oik = 10 blockbuster movies in Dutch cinema theatres in 2007 



If the intervals Qj and Qj* do not overlap, then the distance between r.j and r.j* 
exceeds and hence the corresponding null hypothesis (4) should be rejected. 
The corresponding rank plot is displayed in Fig. 1 . 

We can observe in Fig. 1 that the best ranked movie is “The Bourne Ultimatum”. 
This interval has the lowest upper boundary and consequently all intervals above this 
upper boundary are ranked significantly lower. That are all movies except the movies 
“Pirates of the Caribbean: At World’s End”, and “Ocean’s Thirteen”, as the intervals 
of these movies do have overlap with the interval of “The Bourne Ultimatum” and 
hence are not significantly ranked lower. 

Based on the multiple comparisons we form homogeneous subsets of movies by 
cluster analysis. The distance matrix summarizes the significance tests (4) by zero’s 
(non rejection) and ones (rejection). When movies j and j* are not significantly 
different, the difference a.j — a.j* multiplied by e is added in the corresponding 
distance matrix, e should be chosen small enough so that the multiple comparison 
results are not disrupted. Here we set e equal to 0.001. The corresponding den- 
drogram from this cluster analysis suggests three main clusters. Cluster 1 contains 
the best ranked movies: “The Bourne Ultimatum”, “Pirates of the Caribbean: At 
World’s End” and “Ocean’s Thirteen”. Cluster 2 contains the worst ranked movie: 
“Mr. Bean’s Holiday” and the last cluster contains all other movies. In sum, there 
seem to be just three clusters of movies with the same within-cluster rank. 
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5 Conclusion 

Preference rankings are easy to perform and the outcomes are easy to understand. 
However, in practice we often encounter the problem that consumers hnd it diffi- 
cult to rank preferences across a large amount of alternatives. Our solution is to 
ask consumers to rank only a subset of these alternatives and we have shown that 
it does not matter whether these individual subsets are all different. We have pro- 
vided a methodology how to handle ties in this situation and how to analyze such 
data. We have given the test statistic to examine if observed rankings imply statis- 
tically signihcant differences between alternatives. Further, if there are differences 
across alternatives, we have explained how to apply multiple comparisons to deter- 
mine which alternatives differs. Moreover, based on these multiple comparisons 
we propose a method to perform cluster analysis to construct homogeneous groups 
of alternatives. We have illustrated our methodology with data of ten blockbuster 
movies in Dutch cinema theatres and we found that there are basically just three 
groups of movies with common ranks. 

We envisage a range of practical applications of our methodology, for example in 
the area of conjoint analysis. Consumers can now face many alternatives, but when 
they are asked to rank just a few alternatives, the task will become less demanding. 
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Considerations on the Impact 
of Ill-Conditioned Configurations 
in the CML Approach 



Antonio Punzo 



Abstract Considering complete designs, the configurations of non-existence of the 
Maximum Likelihood (ML) estimates for the Partial Credit Model are known in the 
Joint (JML) approach: null categories and ill-conditioned patterns are the only two 
sources of trouble. In the Conditional (CML) approach, apart from datasets with 
null categories, the other “anomalous” configurations are not known. In this paper, 
the impact of ill-conditioned patterns in the conditional approach, as well as the 
incidence of CML-anomalous configurations, are both studied by a systematic anal- 
ysis on small-dimensional data matrices. Obtained results emphasize the presence 
of a large number of additional CML configurations of non-existence, compared to 
those valid in the JML case. 

Keywords Conditional maximum likelihood • Partial credit model. 



1 Introduction 

In psychometrics and educational testing the analysis of the relation between latent 
continuous variables and observed categorical variables - which can be dichoto- 
mous or (nominal/ordinal) polytomous - is known as Item Response Theory (IRT). 
In applications it is very common to have data that are polytomous, above all with 
three or four categories (e.g., in aptitude testing, the response is often classified in 
one of the following categories: “wrong”, “partially correct”, “fully correct”). The 
purpose of using more than two categories per item is to try to obtain more informa- 
tion about the latent trait generically referred to as “ability”, of the people being 
measured. 

The discussion will be concentrated on polytomous IRT models for items with 
ordered categories. In this context, the unidimensional (parametric) polytomous 
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Rasch models constitute an appropriate reference frame given the good theoretical 
and practical properties which they enjoy. From a theoretical point of view, they are 
the only models to conform to the fundamental measurement theory adhering to the 
principle of specific objectivity (Fischer, 1995; Rasch, 1977). From a practical point 
of view, it is easy to observe that the distribution belongs to the exponential family 
and this makes these models simpler in terms of handling requests from the inferen- 
tial procedures based on the Maximum Likelihood (ML). Both of these theoretical 
and practical advantages are closely linked to the existence of non-trivial sufficient 
statistics for the parameters and they would contribute to an a priori exclusion of 
alternative IRT models. 

Attention will be focused on the Partial Credit Model (PCM; Masters, 1982) 
which can be considered - conceptually and structurally - as the most general 
among the polytomous Rasch models. Successful applications of the PCM to a wide 
variety of measurement problems in psychology, education, medicine, marketing 
and other fields where testing is relevant, have been reported in literature (for a list 
of these applications see, e.g.. Masters & Wright, 1997). Moreover, the simplicity of 
the model formulation makes it easy to implement, and a range of specific software 
is devoted to it. The PCM is presented in Sect. 2. 

The model contains only two different kinds of parameters: hxed-effect param- 
eters for items, and person parameters. All of these parameters are locations on a 
underlying variable and their estimate is the only realization of the ideas embodied 
in the measurement model. This feature distinguishes the PCM from IRT models 
including item “discrimination” or “dispersion” parameters, which qualify locations 
thus confounding the interpretation of the latent variable. As regards the person 
parameters, they can be seen either as fixed-effects or as random-effects, with con- 
sequences for the estimation methods and inferences one can make; usually, the first 
point of view is preferred. In this frame, the commonly used approaches are: Joint 
ML (JML) and Conditional ML (CML). 

The joint approach estimates person and item parameters simultaneously max- 
imizing the joint likelihood. A major disadvantage associated with JML is that 
the estimators of item parameters are not consistent when the number of subjects 
approaches infinity (Neyman & Scott, 1948). The reason is that the number of 
parameters increases at the same rate as the sample size increases because each 
new person implies a new parameter. 

As an alternative, Rasch (1960) suggested estimating the item parameters by 
the CML method, where the conditioning is with respect to the sufficient statis- 
tics for the individual parameters. Under suitable conditions on the variability of & 
in the population, these CML estimates are consistent (see Andersen, 1995). This 
approach, for the PCM, is described in Sect. 3. 

The discussion is focused on the existence of the estimates for the PCM under 
the assumption of a complete data matrix without extreme row patterns. Section 4 
is devoted to the presentation of the state of the art on this issue both for JML 
and for CML approaches. In Sect. 5 - starting from an example of configuration 
of non-existence in both analyzed estimation approaches - a systematic analysis 
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concerning fixed small-dimensional datasets is developed; the aim is to make useful 
considerations about the problem of existence for the CML estimates. 



2 The Partial Credit Model 



The PCM is applicable in the following context. Consider the responses of a 
«-dimensional set S = {Si, . . . , S^,, . . . , S„} of subjects to a A: -dimensional 
sequence X = {/i 7^} of items. Each subject may respond to item 7, in 
m + 1 (;n > 1) ordered categories, Co, Ci, . . . , Ch, ■ ■ ■ , C,„. The score is chosen to 
be h in correspondence to Ch, h = 0, 1 , . . . , m; with this require, the item is con- 
ceptualized as a series of ordered steps and the respondent receives a unitary credit 
for each successfully completed step (Wright & Masters, 1982). 

Before proceeding with the model presentation, it is convenient to write the actual 
response for individual 5v to 7,- as the selection vector = (yyfo, yvn, . . . , yvim), 
where is an observation from the random variable Y yi and y^ih = 1 if the 
response is in category Ch, and 0 otherwise. From here on will be assumed that, for 
each item, the subject chooses one and only one of the m -|- 1 categories; conse- 
quently, incomplete designs will be excluded from the analysis. Moreover, let Xyi 
be the observed score of Sy with reference to 7, . Naturally, Xyi e {0, 1, • • • , ni}. Let 
X = (Xyi) be the score matrix. 

Usually, the PCM is introduced through the specification of the probability 
that a subject Sy, with parameter &y, will respond to item 7, in category C/,. There 
are several equivalent parameterizations of pyth (to preserve the theoretical proper- 
ties of the Rasch family, all possible model parameterizations must be defined on 
successive dichotomization of adjacent categories; see, e.g., Bertoli-Barsotti, 2008), 
but the most “economic” one, in view of the algebraic elaborations, is 



Pvih = PiYyih = \ \^y,^i) = 



exp [yyih Wi - Pih)] 

m 

^exp(??vf -Pit) 

t=o 



h = 0,l,...,m, (1) 



where jS, = (Pn , . . . , Pfh, . . . , Pim) is the item parameter vector related to 7, 
and, for notational convenience, /l,o = 0. If m = 1, the simple Rasch Model 
(RM; Rasch, 1960) is obtained. It is straightforward to show that the model defines 
an exponential family. 

It is to be noted that if §■* = d-y + c and P*/^ = Pfh + he for any real constant c, 
then h — P*i^ = h&y — Pih ■ To allow the identifiability of the model, the parameters 
must be normalized, for instance, with the following constraint: 

k m 

= ( 2 ) 

i = l h=0 



Thus, there are n + km — 1 unconstrained parameters to be estimated. 



566 



A. Punzo 



3 CML Approach to Estimate Item Parameters 

Given the nk independent observations v = \, ... ,n,i = \, ... ,k, and adopt- 
ing the usual “dot” notation (e.g., stands for yvih, and so on), the item 
parameters /3 = (jSj, , /3^) can be estimated, given the person’s raw 
scores 

k m m 

^ ^ hy^ih = ^ hy,„h, v = 1 , . . . , n , 

, = 1 h=0 h=0 

maximizing the conditional log-likelihood function 

k m mk~\ 

IciP) = - XI 

( = 1 /i=0 r=l 



where is the number of persons with a particular score r and 



Yr iP) = 

6') 



EEA 

i = l h=0 



ihyvih 



r = 0, I, . . . , m, 



are known as elementary symmetric functions', is the sum over all response 
vectors that produce the score r. The sum implicitly excludes extreme 

person’s raw scores of 0 or mk because these subjects do not affect the conditioned 
procedure. By conditioning the likelihood onto r = {r\, . . . .r^, . . . ,r„), the person 
parameters f = (j?i, . . . , )?v, . . . , which in this context are nuisance param- 
eters, vanish from Ic (fi). It is straightforward to note both that the item-category 
totals _y„ 7 , are minimally sufficient for fa , , and that this conditional function belongs 
to the exponential family with minimal representation (consequently, Ic (jS) is 
strictly concave and this issue is important in phase of maximization). 

To maximize Ic (/?), all derivatives with respect to fu, must be equate to 0. Some 
algebra leads to the following set of CML equations: 

(0) 

y,ih = exp{-fih)^nr ’’ , i = l,...,k, h = Q,\,...,m, (4) 

, Yr iP) 

where ()3) denotes the elementary symmetric function evaluated by omitting 
item /, . To solve (4) most computer algorithms use a Newton-Raphson procedure, 
which is fast in the sense that it usually requires few iterations, and in the sense that 
the quantities involved are ratios of elementary symmetric functions, easily obtained 
from convenient recurrence relations (see, e.g., Andersen, 1995). 



Impact of Ill-Conditioned Configurations in the CML Approach 



567 



4 State of the Art Regarding Existence of ML Estimates 

A finite solution of (4) does not always exist. Undoubtedly, a source of trouble is 
the presence of null categories', in the implementation of partial credit analysis, all 
parameters can be estimated for an item only if observations occur in each of the 
available response categories (see Masters & Wright, 1997). The condition of pres- 
ence/absence of null categories is routinely checked by some estimation programs 
(see, e.g., Mair & Hatzinger, 2007). In this regard, Wilson and Masters (1993) have 
developed a procedure for automatically reparameterizing the PCM to provide JML 
estimates of a smaller number of item parameters when one or more response cat- 
egories for an item are null. Again, a person with an extreme total score is also 
“particular”: in the JML case Wright and Masters (1982) recommended removing 
it; this suggestion holds also in the CML case since these persons do not affect 
the conditioning. For these reasons, from now on, only score matrices without both 
extreme row patterns and null categories will be considered. 

For the simple RM {m = 1), Fischer (1981) elaborates, both for complete and 
incomplete designs, the n.s. conditions for the existence and uniqueness of a solution 
for the JML and CML estimation equations easily verifiable from x total scores. 
Thanks to this latter result, many estimation programs can perform a preliminary 
check on x to verify if the (joint or conditional) likelihood function admits a unique 
critical point. 

Unfortunately, more complicated conditions are necessary for the existence of 
the ML estimates for the PCM. Bertoli-Barsotti (2005), only for complete designs, 
gives a n.s. condition for the existence and uniqueness of the JML estimates, in the 
form of a method simply verifiable from the x structure. This result is based on 
a direct extension of the concept of ill-conditioned matrix introduced by Fischer 
(1981) for the dichotomous case. Roughly speaking, the author dehnes a dataset as 
ill-conditioned if there exists at least a partition of S into two non-empty subsets, 
5i and S 2 , such that if a subject belongs to S 2 , his score on /, is not better than the 
score on li, i = 1 , • • • , A:, of any other subject in ; then, the subjects in appear 
to be “infinitely more able” than those from S 2 , and no comparison of subjects 
from the two classes is possible. If no such subsets exist, the data are said to be 
well-conditioned. The latter property is necessary for the existence of a finite and 
unique JML solution for the PCM; it is also sufficient, provided that there are not 
items with null categories. Bertoli-Barsotti (2005) also presents a useful method for 
establishing ill-conditioning. 

Always for the PCM, but in the CML approach, the conditions are in princi- 
ple known for the exponential family either in the original Barndorff-Nielsen form 
(Barndorff-Nielsen, 1978), or in the form given by Jacobsen (1989). However, it is 
not yet known whether these conditions can be brought into such a form that the 
existence of solutions of the ML equations can be verified from the x structure (see, 
e.g., Andersen, 1995). An attempt in these terms is provided by Bertoli-Barsotti 
(2002) that, based on known notions of convex analysis, presents a n.s. condi- 
tion for existence and uniqueness of the ML estimates, in the case of a concave 
log-likelihood function, that is more easy-to-prove than the existent ones. 
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Table 1 Ill-conditioned score matrix and CML estimates 

(a) Score matrix (b) CML estimates 





h h h h 


I 


/?il j3i2 




2 12 2 


7 h 


-8,259 -41.928 


S2 


2 2 11 


6 h 


-33.668 -16.680 


Ss 


2 2 10 


5 h 


16.294 33.974 


Si 


2 10 1 


4 h 


16.294 33.974 


[ Ss 


2 10 0 


3 




f 5a" 


■2‘(TT)‘(r 


■5' 




Sr 


110 0 


2 




[ Sb 


0 10 0 


1 




Tot. 


13 9 4 4 







(c) Accurate CML estimates 



Working 

precision 


I 


0ii 0i2 




h 


-10.871 -52.307 


40 


h 


-41.435 -20.373 


h 


20.369 42.124 




h 


20.369 42.124 




h 


-13.205 -63.973 


50 


h 


-50.768 -25.039 


h 


25.035 51.457 




h 


25.035 51.457 




h 


-16.205 -78.973 


60 


h 


-62.768 -31.039 


h 


31.035 63.457 




h 


31.035 63.457 



5 Analysis of Fixed Small-Dimensional Datasets 

An ill-conditioned complete score matrix x with « = 8 persons and k = 4 items, 
each of them with three categories, is considered in Table la. A finite JML solution 
for the parameters, related to this matrix, does not exist; instead, a CML solution (for 
what is so far known in literature) could exist since null categories are not present. 
CML estimates, obtained by means of the R package eRm (Mair & Hatzinger, 2007) 
with the constraint (2), are summarized in Table lb. To solve the CML equations (4), 
the Newton-Raphson procedure is used in the eRm package. These estimates do not 
appear to be “meaningful” if one considers that the results are expressed in logits. 
This inkling of concern is confirmed by an exhaustive numerical analysis of Ic (P) 
accomplished in Mathematica environment. Indeed, with reference to Table Ic, it 
is easy to note that increasing the numerical accuracy of the maximization algorithm, 
also the estimates of the parameters (in absolute value) increase. 

To realize that a finite maximum point does not exist, a simple analytic study 
of /c (jS) is sufficient. This problem is not revealed by the eRm package because 
the Newton-Raphson approach fails in the convergence and a false critical point is 
detected; consequently, a user could erroneously believe in these “false estimates”. 
From a geometrical point of view, the curve to be maximized presents a kind of 
“plateau”: small variations, in this “flat” zone, of the maximization function value 
correspond to large variations in the position of the point. 

Summarizing, parameters related to the dataset in Table la turn out to be ines- 
timable with both JML and CML method. For the sake of uniformity, from now 
on, a dataset corresponding to a likelihood function without a critical point will be 
defined as anomalous. This expression embraces a sef of factors; for example, in the 
JML estimate for the PCM, score matrices with null categories and ill-conditioned 
configurations are the two factors of anomaly (Bertoli-Barsotti, 2005). In order to 
differentiate the configurations leading to non-existence of parameters with both 
JML and CML approach, the expressions JML-anomalous and CML-anomalous 
will be adopted, respectively. 
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In the light of this issue, and borrowing the idea of Linacre (2004), an in- 
depth analysis on fixed small-dimensional datasets with three categories has been 
performed. These kinds of matrices have no practical interest but they turn out 
interesting from a theoretical point of view. The aim is to evaluate the impact 
of ill-conditioned conhgurations in the CML approach, as well as the incidence 
of CML-anomalous datasets. Null categories will not be considered here because, 
although they are associated with a problem of non-existence of the estimates in 
both JML and CML methods, they are usually approached as a problem of non- 
identifiability of the model, interpreting them as indicating that a new dehnition of 
the categories is required. 

To introduce the analysis, it is convenient to consider the case of 4 x 3 score 
matrices with categories scored as 0, 1 or 2. To evaluate the “really different” (from 
an estimate point of view) matrices of this kind, an automatized procedure in R envi- 
ronment has been implemented. Initially, by means of a 3-nomial tree, it generates 
all possible 3'^ = 531441 matrices of this kind. Let A be the set of all matrices 
with at least a null category for at least an item in X = {I\, I 2 . h}\ these matrices 
are 484785. Let B be the set of all matrices with at least an extreme person’s raw 
score of 0 or 6 ; these matrices are 140816. In the first step the procedure removes 
from the analysis the set C = ^4 U 5 of cardinality 497037 (a data matrix could 
have both null categories and extreme person’s raw scores, i.e., A B 7 ^ 0). In 
reality, one could consider the matrices belonging to B but, from a practical point 
of view, their “real” row dimension should be equal to the number of rows without 
extreme total scores whereas the analysis is restricted to 4 x 3 score matrices. The 
remaining 34404 matrices are ordered according to a non-increasing sorting with 
respect to both row and column totals (note that the order of both subjects and items 
is only conventional). After this step, only 1333 matrices turn out to be different 
among them (“different” means in at least one entry of the matrix). It is to be noted 
that these matrices are not yet “really” different from an estimate point of view; 
they are different up to row/column permutations. To realize the issue, it is useful 
to consider the two score matrices in Table 2. With a simple permutation between 
the second and the fourth row of the matrix in Table 2a, the matrix in Table 2b is 
obtained. According to this example, it is possible to assert that the “really” differ- 
ent matrices are only 273; each of them can be considered as representative of an 
equivalence class in which the equality is meant entry by entry up to row/column 
permutations. These 273 score matrices are shown in Table 3. 

After a joint analysis with Mathematica and eRm, the 68 matrices shown in 
roman bold prove to be CML-anomalous. Therefore, the presence of anomalous 



Table 2 Score matrices different up to permutations 

(a) Score matrix (b) Score matrix 
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S3 
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S4 


1 
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Table 3 All 273 “substantially different” 4x3 score matrices with three categories. The horizontal 
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configurations, that are not due to null categories, is confirmed. Among these 68 
matrices, it is possible to identify seven matrices (shown in italics) that are also 
ill-conditioned; these matrices are the only JML-anomalous in the analysis. Conse- 
quently, there exist 61 well-conditioned matrices (the overwhelming majority) that 
prove to be CML-anomalous: the ill-conditioned configurations are so a small sub- 
set of the CML-anomalous ones. In other words at least another factor of anomaly, 
apart from both null categories and ill-conditioned patterns that have influence in 
the JML approach, should exist. 

From a similar analysis implemented on about 1000 small-dimensional matrices, 
different from 4x3 ones, equivalent results, in relative terms, have been obtained. 



6 Concluding Remarks 

Considering complete score matrices without both extreme row patterns and null 
categories, the existence of CML estimates for the PCM item parameters has been 
investigated here. Several examples of anomalous datasets have been given. From 
a systematic analysis of small-dimensional score matrices, with three categories, 
a CML-anomalous configuration has been detected whenever an ill-conditioned 
one has been considered. Consequently, the ill-conditioning should be seen as a 
sort of sufficient condition for the CML-anomaly. Moreover, from the same analy- 
sis, the existence of several CML-anomalous configurations that are different from 
ill-conditioned ones, stands out; this denotes a stronger incidence of anomalous 
configurations in the CML approach than in the JML one. 

From a practical point of view, when the analysis is performed on more realistic 
datasets, heuristic considerations lead to the conclusion that the incidence of par- 
ticular configurations, like the ill-conditioned ones, should be minor. Logically, the 
possible presence of missing values, that has not been analyzed in this paper but that 
will be soon considered, could heighten this incidence. 

The still open question is to analyze the structure characterizing these 
CML-anomalous configurations from a theoretical point of view. According to this, 
it may be useful to systematically consider matrices of small dimension and by 
means of a few categories to emphasize the need for specific constraints. In this 
way, all packages or programs performing a CML estimation procedure by a maxi- 
mization algorithm like the Newton-Raphson one - algorithms that are exposed to 
the problem to try false critical points - could perform an initial data check to detect 
these anomalous configurations. 
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Dyadic Interactions in Service Encounter 
Bayesian SEM Approach 



Adam Sagan and Magdalena Kowalska-Musiat 



Abstract Dyadic interactions are an important aspects in service encounters. They 
may be observed in B2B distribution channels, professional services, buying 
centers, family decision making or WOM communications. The networks consist 
of dyadic bonds that form dense but weak ties among the actors. 

The aim of the paper is to identify latent properties of dyadic interactions on 
mobile phone service market. Latent variable models in relational marketing often 
focus either on the effects of relations or treat the relationship dimensions as 
psychological constructs on individual-trait level. 

We propose an approach based on Bayesian latent variable modeling of social 
networks with dyads as analytic units. This approach allows to model emergent and 
relational properties of actors’ interactions in dyads that are irreducible to individual 
latent traits or psychological constructs. 

Several competing models are developed and compared using Bayesian structural 
equation models of dyadic data. Bayesian SEM helps to overcome the limitations 
of the more traditional solutions based on ML or WLS estimations. It is robust for 
small samples which are common in social network analysis, it can also be applied 
to non-normal data as well as non-linear relations between latent variables. 

Keywords Bayesian SEM • Dyadic data • Relationship marketing. 



1 Introduction 

1.1 Service Encounter in Relationship Marketing 

Understanding the structure of relationship is fundamental in marketing and espe- 
cially in relationship marketing. The relationship is a paradigm in the academic 
discipline of marketing (Grdnroos, 1993). The relationship is defined as comprising 
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ongoing long-term interactions which are dynamic process in contrast to transac- 
tional discrete exchanges (Dwyer, Schurr, & Ohr, 1987). 

Academic researchers are interested in relations in three aspects of relations: 
business-, services and consumer - marketing. The first sector is husiness-to- 
husiness relationship, which has a long tradition of investigating many relational 
constructs for example: relationship between members of distribution channels, trust 
between trading partners, or cross cultural differences within trading dyads (Ford, 
Hakansson, and Johanson, 1986; Anderson, Hakansson, and Johanson, 1994). These 
relationships in channels are viewed as dyadic. In intra-organisational communica- 
tions or buying centers network contexts are established to analyze this structure 
due to large numbers of interdependent business relationships. 

The second sector, newer than the previous one, also recognizes the importance 
of relationship. This is service marketing, applicable especially to professional ser- 
vices, which are likely to be made up of long term series interactions. Purchase 
of services is thought to be a process which partially depends of interpersonal 
interactions between the service provider and the customer (lacobucci & Ostrom, 
1996; Ostrom & lacobucci, 1995). Professional service encounters are an example 
of dyads interactions. 

Building relationships are also important in the business-to-consumer arena 
(Winkelman, Schulz, Edelman, & Silverstein, 1993) as they encourage relationships 
with individual consumers and focused on serving consumers’ needs over time. On 
this market, family decision making is an example of dyads interactions; the oppo- 
site is word-of-mouth communication which takes place in networks that involve 
larger groups of consumers (lacobucci and Hopkins, 1992). 

The service encounter is a human interaction between a buyer and a seller in 
the service setting. It is based upon the interactive process, an interactive interface 
between a service provider and a service receiver (Gronroos, 2001). Czepiel (1990) 
insisted that research into service encounters should take into account the perspec- 
tives of both parties involved in this human interaction. The dominant metaphor 
used in relationship marketing literature is to present service encounter experi- 
ence as a theater (Harris, Harris, & Baron, 2003). This metaphor provides context 
for the study service encounters: role theory. Both interacting parties, the client 
and the seller assume a role. In these roles verbal and nonverbal communication 
frame empathy and satisfaction of service encounters, and determine caller service 
encounters experience (Broderick, 1998). 



1.2 Research Design 

Social network methods have been used to understand the structure of relation- 
ships better (Konke & Kuklinski, 1982; Scott, 2007). A characteristic feature of 
any network approach is the recognition of interdependence among the researched 
entities. 
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Table 1 Constructs and its indicators 



Construct 


Indicator 


Label 


Empathy 


Repeating words 


SEl(CEl) 


Empathy 


Finishing sentences 


SE2(CE2) 


Empathy 


Leaning forward 


SE3(CE3)* 


Satisfaction 


Eye contact 


SSl(CSl) 


Satisfaction 


Smiling 


SS2(CS2) 


Satisfaction 


Nodding 


SS3(CS3) 


Involvement 


More talking than listening 


SI(CI) 



* Item excluded from the analysis 



Predominant SEM models of relationship marketing in the contemporary litera- 
ture are based on the assumption that relationship characteristics (loyalty, empathy, 
trust, etc.) are treated as psychological constructs which are inferred from an indi- 
vidual unit of analysis. These models usually omit interdependencies between actors 
and the dyadic nature of relationships. In the proposed model dyads are the focal 
units of analysis so interdependence between the actors is recognized within the 
context of buyer-seller relationship. The dyad consists of service provider and cus- 
tomer and it is a basic unit of analysis. We want to understand the process (service 
encounter) by looking at both parties of a transaction as a dyad not individualities. 
Of course, dyadic interactions may be regarded as a special case of networks, in 
which pairs of actors are independent from each other. 

The analysis of service encounter was based on the observational data. The 
observation questionnaire was developed and each held worker has observed an 
interaction between buyer and service provider and marked the non-verbal types 
of behavior. Observation of the 55 service encounters (dyads) were conducted at 
selected retail outlets of mobile phone operator Orange (28 dyads). Era (15 dyads) 
and Plus GSM (13 dyads) in Cracow and small towns of the Malopolska region. Data 
was collected for three days. The aim of the analysis was to measure the empathy 
- satisfaction relationship and its influence on the length of a service relation. We 
proposed behavioral indicators of empathy and satisfaction with the relation which 
were used in McKechnie, Grant, and Bagaria (2007). The behavioral indicators of 
the constructs are shown in Table 1 . 



2 APIM Model: Bayesian SEM Approach 
2.1 Assumptions of Bayesian SEM 

Traditional structural equation modeling is based on ML (GLS) approach in the 
parameters estimation. However, the Bayesian latent variable modeling is rapidly 
becoming more popular in marketing (Rossi, Allenby, & McCulloch, 2005). The 
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most attractive feature of Bayesian estimation of SEM models is its flexibility in 
using useful prior information for obtaining better estimations of parameters. 

The estimation begins with specification of the data distribution y given the unob- 
servable parameters 6,p(y\0). The prior beliefs about 6,p(6) come from many 
sources including theories, expert knowledge, analysis of similar data, etc. The 
role of sensible prior information is more significant, when the samples are small 
(however wrong prior assumptions may negatively affect the final result). 

The dependence of the posterior on the prior (which can easily be assessed by 
trying different priors) provides an indication of how much information on the 
unknown parameter values is contained in the data. If the posterior is highly depen- 
dent on the prior, then the data contains little information, while if the posterior is 
largely unaffected under different priors, the data is likely to be highly informative. 

Additionally, Bayesian estimation depends less on asymptotic theory and there- 
fore produce reliable results even with small samples. As the sample size increases, 
or investigator uses non-informative priors, the Bayesian solution become similar to 
ML estimate. 

In contrast to maximum likelihood method, in Bayesian estimations, parame- 
ters are considered as random with prior distribution and a prior density function 
(Lee, 2007). Once the data is collected, it is combined with prior distribution using 
Bayes theorem, next posterior distribution p{0\y) is calculated reflecting the prior 
knowledge and empirical data. Joint posterior distribution is summarized using 
Markov Chain Monte Carlo (MCMC) simulation techniques in terms of lower 
dimensional summary statistics as posterior mean E{6\y) = f 6p(6\y)d6 and 
posterior standard deviations. 

Another feature of Bayesian analysis is that it lowers the risk of improper solu- 
tions resulting from small samples by choosing a prior distribution that assign zero 
probability to such improper solutions (Martin & McDonald, 1975). 



2.2 APIM Structural Model 

Buyer-seller interaction ha been widely analyzed and modeled in the context 
of satisfaction and loyalty research in service encounter (Wong, 2004). The Actor- 
Partner Interdependence Model (APIM) was used to investigate the buyer-seller 
interaction. The APIM models were developed primarily in developmental psy- 
chology to explain close relationships in dyadic and network settings (Kenny, 
Kashy, & Cook, 2006). Two main effects are identified (a) an actor effect, where 
one actor’s feature influences another feature of the same actor, and (b) a partner 
effect, where a feature of one subject influences the feature of another partner. In 
the models comparisons, the equality of actor and partner effects parameters can 
be assumed and tested. Additionally, the moderator variables may constitute four 
interactional effects; partner-moderated partner effect, actor-moderated partner 
effect, partner-moderated actor effect and actor-moderated actor effect 
(Cook & Kenny, 2005). 
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Fourteen various structural models were developed and compared on the basis of 
predictive posterior p-value (PP p-value) and Deviance Information Criterion (DIC). 
The posterior predictive p-value is the Bayesian alternative to classical /i-value for 
model evaluations. Posterior predictive p-value which is close to 0.5 indicates a cor- 
rect model (Meng, 1994). It means that real data y is typical of data that comes 
from the model. The Deviance Information Criterion is used for comparing models 
in the same way that AIC and BCC are used for model comparison. The Deviance 
Information Criterion (DIC) is a generalization of AIC and is used for model com- 
parisons. DIC is the sum of the posterior mean of deviance and the effective number 
of parameters. According to this criterion the model with the smallest DIC is to 
be preferred. The following models were compared: (1) an actor- moderated partner 
effect model with equal actor and partner effects (PP p-value = 0.49, DIC = 426.07), 
(2) an actor-moderated partner effect model with free actor and partner effects 
(PP p-value = 0.50, DIC = 425.75), (3) an actor-moderated actor effect model with 
equal actor and partner effects, (PP p-value = 0.50, DIC = 43 1.04) (4) an actor- 
moderated actor effect model with free actor and partner effects (PP /7-value = 0.50, 
DIC = 434.67), (5) a partner-moderated partner effect model with equal actor and 
partner effects (PP /7-value = 0.50, DIC = 434.45), (6) a partner-moderated partner 
effect model with free actor and partner effects (PP /7-value = 0.50, DIC = 435.36), 
(7) a partner-moderated actor effect model with free actor and partner effects (PP p- 
value = 0.50, DIC = 44 1.51), (8) a partner-moderated actor effect model with equal 
actor and partner effects (PP /7-value = 0.50, DIC = 439.18), (9) an actor moderated 
actor-partner effect model with free actor and partner effects (PP /7-value = 0.50, 
DIC = 420.87), (10) an actor moderated actor-partner effect model with equal actor 
and partner effects (PP /7-value = 0.49, DIC = 424.09), (11) a parter moderated 
actor-partner effect model with free actor and partner effects (PP /7-value = 0.51, 
DIC = 433.30), (12) a partner moderated actor-partner effect model with equal 
actor and partner effects (PP /7-value = 0.49, DIC = 432.54), (13) an actor and parter 
moderated actor-partner effect model with free actor and partner effects (PP p- 
value = 0.50, DIC = 430.31), and (14) an actor and parter moderated actor-partner 
effect model with equal actor and partner effects (PP /7-value = 0.49, DIC = 427.78). 

The values of PP /7-value are more or less comparable and show good fit of 
the proposed models. The final ninth model was used on the basis of DIC mini- 
mum value. However, the DIC of remaining models are also not so different from 
the model selected. The model with the smallest DIC has the highest chance of 
predicting a replicate data set. The final model is shown in Fig. 1. 

In this model it is assumed that seller’s empathy (SE) influences both her/his sat- 
isfaction (SS) with the relation and customer’s satisfaction (CS). Customer empathy 
(CE) also influences his/her own and seller’s satisfaction. The first relation is known 
as actor (intrapersonal), and second as partner (interpersonal) effect (Kenny, 1996). 
The length of relation (LR) is the function of buyer and seller satisfaction of the rela- 
tionship. Additionally, two exogenous moderator variables are introduced, namely 
client and seller involvement (Cl and SI respectively). 

Because of nonindependence between the buyer and the seller, error variances 
and disturbances between constructs are correlated. The factor loadings for the 
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Fig. 1 APIM model of service encounter 



latent variables have been established to be equal for both partners reflecting the 
fact that the constructs have the same meaning for both sides of the relation- 
ship. Equality of latent variable variances and error variances indicates that buyers 
differ from one another to the same degree that sellers differ from one another 
(Kenny et al., 2006). 

The basic APIM model exhibits good fit with posterior predictive (PP) 
p-value = 0.49 and DIC = 420.87. Figure 2 shows the frequency polygons of mar- 
ginal posterior distribution of the parameters in structural part of the model along 
with trace plots. 

Because of lack of sufficient knowledge, diffuse uninformative priors are used in 
the estimation process. 

Trace plots reveal the stability of MCMC convergence procedure and there is 
some evidence that there is regular rapid up and down variation with no trends or 
drifts, so the sampled value at any iteration is unrelated to the sampled value k iter- 
ation later. Convergence in MCMC algorithm was obtained at less than 1.0002 con- 
vergence statistic (Gelman, Carlin, Stern, & Rubin, 2004), where 1.0000 represents 
perfect convergence. 

The measurement part of the model shows that factor loadings for empathy con- 
struct are lower than loadings for the satisfaction (one indicator was removed due 
to negative loading). In the structural part, seller effects and client partner effects 
are significant at 95% (**) or 50% (*) credible intervals. Credible interval is the 
Bayesian highest density region (HDR), where there is 1 — a probability that it 
contains the true value of unknown parameter. 

The relatively higher actor effect for the seller has been observed. The more 
seller empathy, the more seller satisfaction (as = 0.39). The client empathy has no 
influence of his/her satisfaction (ac = 0.06). 

Both partner effects are positive and the partner effect for the sellers is also 
relatively stronger (ps = 0.46) than for the clients (pc = 0.12). It is clear that seller 
empathy has strong and positive impact on client satisfaction with the relation. The 
length of relation moderately and more or less equally depends on both buyer and 
seller satisfaction (0. 17 and 0. 19 respectively). The involvement of the client and the 
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Fig. 2 Posterior distributions and trace lines of regression weights 



seller has a negative correlation (—0.17) which shows the negative adjustment of the 
role taking process (the more the client is involved, the less is the seller). Table 2 
shows the parameters of main and interaction effects with 95% credible interval. 

Only the actor and the partner effects for the seller are relatively strong and signif- 
icant. The partner effect even dominates the actor effect for the seller, so the empathy 
of the seller influences client satisfaction with the relation more than his/her own 
satisfaction. By contrast, client empathy has only a slight influence on seller sat- 
isfaction. This stresses the dominant role of the seller in the service encounter. 
The involvement plays a minor role in shaping of the empathy-satisfaction rela- 
tion. Surprisingly, the seller involvement is significantly and negatively related to 
seller empathy. The interaction terms show that only seller involvement moderates 
the client partner and client actor effects but not seller partner effect. This reveals 
an asymmetric nature of dyadic interaction. The more the seller is involved, the 
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Table 2 Stmctural effects of APIM model 



Structural path 


Estimate 


Lower 95% 


Upper 95% 


Actor effects: 
SE-SS** 


0.39 


0.12 


0.68 


CE-CS 


0.06 


-0.22 


0.31 


SI-SE** 


-0.32 


-0.59 


-0.05 


CI-CE 


0.07 


-0.21 


0.39 


Partner effects: 
SE-CS** 


0.46 


0.23 


0.78 


CE-SS 


0.12 


-0.14 


0.37 


SI-CE 


-0.01 


-0.29 


0.26 


CI-SE 


-0.10 


-0.39 


0.18 


CI-SI 


-0.17 


-0.35 


-.04 


Joint effects: 
SI-LR 


0.19 


0.09 


0.46 


CI-LR 


0.17 


0.04 


0.48 


Interaction effects: 
SI* SE-SS 


0.06 


-0.23 


0.35 


CUCE-CS 


-0.15 


-0.50 


0.20 


SI* SE-CS 


-0.04 


-0.25 


0.32 


CUCE-SS 


0.02 


-0.32 


0.37 


srcE-ss** 


0.19 


0.08 


0.47 


Cl* SE-CS 


0.06 


-0.23 


0.36 


SI*CE-CS** 


0.21 


0.04 


0.48 


CI*SE-SS 


-0.03 


-0.35 


0.27 



** Effects are in 95% of credible interval 



higher mutual satisfaction with the relation and length of relation are. On the other 
hand, the involvement of the client has no direct or moderation effects on partner 
satisfaction. 

Finally, basic APIM models (excluding interaction terms) were estimated for 
service providers such as Era, Orange and Plus GSM. Table 3 shows the struc- 
tural parameters of Bayesian estimation. Unfortunately, because of extremely small 
sample sizes (Aeie = 14, Aoiange =28 and Apius = 13) one cannot generalize about the 
findings and the result may be biased. The estimated parameters are presented within 
50% credible intervals. Fit measures are also unsatisfactory: posterior predictive 
p-level = 0.05 and DIG =139.62. Anyway, the results may serve as the hints for 
hypotheses formulation from managerial point of view. 

Table 3 shows that actor and partner effects differ across service providers. The 
seller actor and seller partner effect is rather similar for Era and Plus GSM, although 
the client actor effect is much stronger for Plus. The influence of satisfaction on the 
length of relation is different for each provider. Client satisfaction plays dominant 
role in case of Era, client and seller satisfaction is almost equally important for 
Orange, and a small and insignificant relation for Plus GSM is observed. 
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Table 3 Structural path estimates for the providers 



Stmctural path 


Era 


Orange 


Plus GSM 


Actor effects: 
SE-SS 


0.67* 


-0.09 


0.67* 


CE-CS 


0.07 


-0.07 


0.33* 


Partner effects: 
SE-CS 


0.69* 


0.20* 


0.66* 


CE-SS 


0.08 


0.16* 


0.16* 


SE-CE 


0.60* 


0.29* 


0.76* 


Joint effects: 
SS-LR 


0.00 


0.29* 


0.11 


CS-LR 


0.40* 


0.22* 


0.08 



* Effects are in 50% of credible interval 



3 Final Remarks 

In the paper Bayesian structural equation modeling was applied to explain the 
buyer-seller interdependencies during the service encounter. The empathy and sat- 
isfaction constructs were inferred from the non-obtrusive observational data. APIM 
model which is popular in developmental psychology for measurement of close rela- 
tionship confirmed its usefulness in dyadic relationship marketing analysis. This 
promising approach may overcome the limitation of using individual-trait data to 
draw conclusions about relational properties of the actors in service encounters. 
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Part IX 

Archaeology and Spatial Planning 



Estimating the Number of Buildings in Germany 



M. Behnisch and A. Ultsch 



Abstract The debate on sustainable development has lead to the view of buildings 
as flows (mass, energy, money and information) or capitals. In this context buildings 
are considered as the largest physical, economical, social and cultural capital of a 
society. In Germany many institutions record different kind of data about buildings. 
Unfortunately there are just a few basic statistics about the amount of buildings. 
Collection of data is very complicated, often expensive and the handling of missing 
data is one of the biggest handicaps. With the exception of data about residential 
buildings and particularly monuments, it is an unsolved problem to determine the 
total number of buildings. Thus the main issue of this article is the description of an 
appropriate estimation procedure. This procedure relies on 12,430 communes and 
refers to data from the Cadaster of Real Estates and the Federal Office for Build- 
ing and Regional Planning (BBR). The estimation is based on statistical data from 
well-known and easily accessible institutions. The number of buildings is estimated 
for communes with missing data. Using methods from the, so called. Urban Data 
Mining approach, unsuspected relationships are found in the urban data. These rela- 
tionships are valuable for the estimation. The quality of the estimation is analyzed by 
training and test data sets. Information optimization leads to the conclusion that 20% 
of the communes hold 80% of all buildings. For an improvement of the estimation 
it is essential to refine the amount and quality of data in the larger communes. 

Keywords Building stock ■ Data mining • Knowledge discovery • Spatial planning. 



1 Introduction 



Urbanized areas are a major component of the modern environment. For the first 
time, more half of the world’s population will be living in urbanized areas by 
the end of this decade (United Nations, 2008). It is widely acknowledged that 
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the contemporary physical space presents a complex structure; research on the 
nature of this structure and the pattern of its growth has remained indispensable. 
Buildings are the largest physical, economical, social and cultural capital of a 
society (Kohler & Hassler, 2002). A comprehensive understanding of buildings is 
difficult to achieve due to their structural (i.e., use, costs, materials) and dynamic 
complexity (i.e., construction, maintenance, refurbishment, demolition). Recently 
the energetic improvement of buildings is challenging. Thus knowledge about the 
mass of buildings and its spatial distribution will provide planning processes of 
construction and especially refurbishment. 

Until now it is an unsolved problem to determine the total number of buildings 
in most countries of the world and in particular in Germany (Hofman, 2001). The 
aim of this article is to present an appropriate estimation procedure for buildings on 
the administrative level of communes. 

Previous approaches in Germany concentrated on the national administrative 
level (Spillner, Russig, Dullinger, von Roncador, & Schunk, 1999). All of them 
have failed to determine the total amount of buildings in Germany. The main rea- 
son is the lack of appropriate official statistics dealing with all types of buildings. 
Usually the housing part is well known because of the particular political interest 
in social housing. The non-housing part, amounting to approximately 50% of the 
whole amount of buildings, is partially covered in the form of annual construction 
reports. Some simplified approaches started therefore to cumulate the annual statis- 
tical data (Hassler, Kohler, & Paschen, 1999). Such approaches missed the former 
total number of buildings as a reference point and the annual number of demolished 
buildings. Threshold approaches lead to rough estimations of the total floor area and 
are based on uncertain assumptions (e.g., mean of floor area in relation to specific 
user statistics). Other approaches are focused on the estimation of just one use-class 
of buildings, e.g., industrial buildings, in combination with extrapolating indicators 
(Hassler & Kohler, 2004). All these approaches are at risk to fail in reliable raw data 
and reasonable assumptions of dedicated extrapolating indicators. The task is to find 
techniques to search for patterns in raw data that can help to explain the underlying 
structure that generated the data (Behnisch, 2008). 



2 Inspection and Transformation of Data 

The estimation procedure relies on 12,430 communes in 2004 and refers to data 
from Cadaster of Real Estates and Federal Office for Building and Regional Plan- 
ning (BBR). In recent years the German Cadaster of Real Estates established digital 
and comparable building data. The raw data of buildings is firstly based on the auto- 
matic real estate map (ALK). It was collected for about 6,560 communes. In parts it 
was possible to proof and extract information about the size, shape and conditions 
of the buildings by the interpretation of aerial and satellite images. Furthermore 130 
variables are set up to find unexpected relationships to the total number of buildings. 
By the use of statistical data from well-known and easily accessible institutions it 
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might be possible to estimate the amount of buildings for communes with missing 
data. Table 1 gives an overview of data dimensions and the amount of variables. 

The task is to identify correlations between the target variable (sum of buildings) 
and all other established variables (see Table 1). Pearsons correlation reflects the 
degree of linear relationship between two variables. Scatter plots and especially den- 
sity scatter plots supplement the investigation to discover non-linear correlations. 
Density scatter plots more directly measure the density in data space. Darker col- 
ors correspond to larger densities. CDF-transformation is used to take into account 
restrictions of multivariate statistics. The cumulative distribution function (CDF) 
describes the frequency of values in a data set in consideration of its probability. 
Data is normalised for the purpose of comparison to the interval 0-1 . Figure 1 shows 



Table 1 Overview of data dimensions and amount of variables 



Dimension 


Source of collected data 


Number of variables 


Area 


Statistics of land-use 


26 


People 


Statistics of population and employment 


55 


Mobility 


Statistics of commuters 


14 


Finances 


Statistics of public finance (tax) 


1 


Reachability 


Reachability model of the BBR 


1 


Housing 


Adjustment of housing statistics 


26 


Spatial configuration 


Geo-computation (e.g., degree of compactness) 


2 


correlation value: 95.1 378 % 





OJ 1 

cdf (population) 



02 0.4 0.6 0.8 




cdf (agricultural area) 



correlation value: 26.6562 % 



Fig. 1 Scatter plot and density scatter plot of population and agricultural area 




588 



M. Behnisch and A. Ultsch 





Fig. 2 Q-Q-plot (log(sum of buildings)) and PDE (log(sum of buildings)) 



an example of a linear and a non-linear correlation as a result of the identification 
process. 

The detected relationships are valuable for the estimation approach. First a 
detailed understanding of each variable and its distribution is necessary. Thus the 
inspection of data includes the visualisation in form of histograms, Q-Q-plots, PDE- 
plots (Ultsch, 2003) and box-plots. Figure 2 shows the distribution of the target 
variable. It is nearly characterised by a log-normal distribution. 



3 Estimation 

A detected log-linear relationship between sum of buildings and population is 
of great value for a single regression approach. Several other detected correla- 
tions might be of interest for a complex and therefore multidimensional estimation 
approach if the single regression approach failed in accuracy. The term unsuper- 
vised data refers to communes with an unknown number of buildings. Supervised 
data characterises communes with a well-known number of inhabitants and build- 
ings. Such data exists for about 53% of all 12,430 communes in Germany. For this 
reason the quality of a single regression approach is analysed by randomly generated 
training and test data sets out of the supervised data. One important result of such 
a quality control using the corresponding test data is the statement that communes 
with many inhabitants are relevant to improve the estimation. In particular the devi- 
ation between training and test data is increasing according to the size of population 
(see Fig. 3, left chart). Beyond it is possible to assess the estimation error for each 
training and test data set. In this case it is below 4% for about more than 95% of 
the analysed training data set (see Fig. 3, right chart). Due to the comparison and 
analysis of all intermediate results of randomly generated training and test data it 
is possible to specify the general estimation error (see Fig. 4). Such a generalisa- 
tion comprises the calculation of the mean general estimation error and its standard 
deviation. The single regression approach leads definitely to reliable results for the 
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Fig. 3 Quality eontrol of a randomly generated training and test data set 




Fig. 4 General error based on 20 randomly generated training and test data sets 
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Fig. 5 Linear regression on log(sum of buildings) and log(population) 



r 

y = 



supervised data set. The general estimation error is below 5% for about 95% of data. 
The range of standard deviation is 2%. 

At least the regression is applied for the whole data set to handle the estimation 
for communes with an unknown number of buildings. Figure 5 marks the location of 
ordered pairs (X,Y) and determines the equation of the regression line. It is to high- 
light that the presented estimation approach allows to determine the total number 
of buildings for the first time. In 2004 existed therefore approximately 38 million 
buildings in Germany. The mass of buildings is determined by the exact number 
of buildings in 6,550 communes plus the estimated number of buildings in 5,870 
communes. In such cases the regression equation is fundamental. 

Figure 6 shows the classified and localised estimation results (unsupervised data 
set) as well as the communes with a well-known number of buildings (supervised 
data set). It is to underline that this estimation procedure allows to describe the 
regional distribution of buildings in detail (on the level of communes). Data Inspec- 
tion of results (e.g., PDF or Gaussian Mixture Model) allows to summarize a subset 
of communes and sharpens the structure in a localised and meaningful way. Thus 
the number of buildings is divided into four labeled classes (village, small town, 
town, city). Urban agglomerations such as Ruhr-Area, Berlin or Stuttgart hold as 
expected many of the buildings in Germany. It is possible to identify an urban and 
rural divide. 
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Fig. 6 Localisation of buildings in Germany 



4 Information Optimisation 

The number of buildings are distributed unbalanced in Germany. There are few com- 
munes with both many inhabitants and many buildings (big cities, agglomerations) 
and there are many communes with relatively few inhabitants and buildings. This 
is shown by a Lorenz curve of the communes with known number of buildings 
(see Fig. 7). The authors used the same idea as for a theoretical foundation of the 
Pareto 80/20-law (Ultsch, 2001) to determine an optimal number of communes to 
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Number of communes [%] 



Fig. 7 Distribution of buildings in German communes 



be investigated in detail. From the ideal point 0% of communes and 100% of knowl- 
edge of the total number of buildings the distance to the real situations on the Lorenz 
curve is measured. The identification mark “a” in Fig. 7 shows the shortest of such 
distances. From this we conclude that in order to gain more precision in the esti- 
mation of Germanys total number of buildings only about 20% of the communes, 
the 20% largest ones, should be measured in deep detail. This is consistent with the 
well-known Pareto 80/20-law (Ultsch). 



5 Conclusion 

By hnding a log-linear correlation between the number of buildings and the number 
of inhabitants (population) it was possible to estimate the German amount of build- 
ings for the first time. The authors suppose that 38 million buildings are located in 
Germany in 2004. Furthermore it was possible to estimate the number of buildings 
on the administrative level of communes. For an improvement of the estimation it 
is essential to refine the amount and quality of data in the larger (highly populated) 
communes. Another important result was discovered by information optimisation. 
Thus 20% of the German communes hold 80% of the total amount of German 
buildings. 
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Comparisons of cities and typological grouping processes are instruments for 
a deeper investigation of this estimation result. The application of symbolic algo- 
rithms such as U*C-algorithm (Ultsch, 2006) leads to a better understanding of the 
presented spatial structure. 

In the future it might be good to optimise the distribution of communes before 
starting an estimation. At the moment it is not possible to start with a randomly dis- 
tributed raw data set because of the data accessibility (see Fig. 6: supervised data, 
i.e., a well-known number of buildings and inhabitants). Furthermore it might be 
of interest to compare the presented accuracy of the single regression approach 
(already below 5%) with a multidimensional approach using the other detected 
correlations (see Fig. 1). 

All in all it is to emphasize that procedures based on knowledge discovery should 
be prepared for an operational integration into the regional and urban planning 
process (Streich, 2005). 
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Mapping Findspots of Roman Military 
Brickstamps in Mogontiacum (Mainz) 
and Archaeometrical Analysis 



Jens Dolata, Hans-Joachim Mucha, and Hans-Georg Bartel 



Abstract Mainz was a Roman settlement that was established as an important 
military outpost in 13 BC. Almost 100 years later Mainz, the ancient Mogontiacum, 
became the seat of the administrative centre of the Roman Province of Germania 
Superior. About 3,500 brickstamps concerning to the period until the fall of the 
Roman Empire in the hfth century AD have been found in archaeological excava- 
tions. These documents have to be investigated based on several methods for a better 
understanding the history. Here the focus is on an application of spatial statistical 
analysis in archaeology. Concretely, about 250 sites have to be investigated. So, we 
compare maps of different periods graphically by nonparametric density estima- 
tion. Here different weights of the sites according to the radius of the finding area 
are taken into account. Moreover we can test whether archaeological segmentation 
is statistically significant or not. In combination of smooth mapping, testing and 
looking for dated brickstamps there is a good chance to achieve new sources for the 
Roman history of Mainz. 

Keywords Archaeometry ■ Correspondence analysis ■ Mapping ■ Stamped Roman 
bricks ■ Weighted nonparametric density. 



1 Introduction 

A total of 1 ,775 Roman military brickstamps concerning to the first century AD have 
been found in archaeological excavations in Mainz, the ancient Mogontiacum. Mak- 
ing a catalogue of these ones for a paper on Roman military archaeology the stamps 
have been classified and new types of stamps have been defined. All in all, 238 find- 
spots of bricks and tiles of the first century have been investigated. Additionally, 
the findspots are described by survey-coordinates. The mapping of the brickstamps 
visualizes the size of the ancient city and gives details for the localization of military 
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Fig. 1 Locations of findspots in Mainz (area 2,300 x 2,100m^) 



camps and of civil settlement. Figure 1 shows an example of usual mapping of a 
zoomed area of Mainz (city) with locations of hndspots. The river Rhine is located 
at the right upper corner. 

Dating the brickstamps by epigraphical investigation or by assigning them to a 
military brickyard based on geochemical analysis (Dolata, Mucha, & Bartel, 2003) 
allows the mapping of different periods. Two main maps have been plotted: (a) 
The earliest brickstamps found in Mainz are from the period of Emperors Claudius 
and Nero (41-68 AD, n = 932). They were manufactured by soldiers of legiones 
XXII Primigenia and IIII Macedonica. (b) Brickstamps of legiones I Adiutrix, XIV 
Gemina, VII Gemina, and XXI Rapax belonging to the Flavian period (69-96 AD, 
n = 843). 

These two main maps can be compared with some maps showing a selection 
of brickstamps from third and fourth centuries (Emperors Caracalla, Constantine 
I or Julian, and Valentinian I, n = 102). Thus the maps show a total of 1,877 
brickstamps from 246 sites. These maps can be compared among each other to 
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discover differences. In this paper we try to improve the situation of evaluating all 
these maps for history of settlement and urban development. Using statistical meth- 
ods we compare the different entries of the maps. Herein every single findspot has 
a specific weight proportional to the corresponding number of findings. 



2 Mapping of the Locations of Findspots 

Dolata (2000) has collected and referenced about 1 ,900 Roman bricks and tiles from 
246 sites in Mainz until now. This paper gives a first view about a work in progress; 
the spatial and statistical analysis of findspots of Roman military brickstamps in 
Mainz. The findings come from the first four centuries AD. The database including 
geographical information itself is under development yet: many hundreds of findings 
are not yet identified by their coordinates of location. 

Additionally to Fig. 1, Fig. 2 gives another view of the same zoomed area of 
Mainz with locations of findspots. Here the locations of findspots are marked by 
bubbles. The size of a bubble is proportional to the number of findings at the 
location. So, the plot becomes much more informative. 




Fig. 2 Locations of findspots in Mainz and their relative size 
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3 Smooth Mapping by Nonparametric Density Estimation 

Usually, nonparametric density estimation is based on observations (sites) i that 
all have the same weight w, = 1. The estimation is a superposition of the 
(small) hills around each site. An introduction in nonparametric density estima- 
tion is given by Hardlem (1990). Applications of density estimation can be find in 
Mucha, Simon, and Briiggemann (2002), and specifically concerning archaeometry 
in Baxter, Beardah, and Wright (1997). The later comes with a range of examples. 
The authors show that this methodology can be used as an informal approach to 
spatial cluster analysis, and one example suggests that it is competitive with other 
approaches in this area. 

Here is a tiny example: Univariate density estimation of points 3, 3.4, 4, 8, 8.3, 
8.5, 8.8, and 9 with Epanechnikov kernel of bandwidth equals 1. Figure 3 presents 
both the principle of superposition of elementary hills and the final univariate 
density. 

Another archaeological application of bivariate nonparametric density estimation 
is given by Mucha, Bartel, and Dolata (2005). From the archaeological point of view 
the importance of the sites are different, and that should be considered in statistical 
models. Therefore, we give different weights to the observations (sites) in nonpara- 
metric density estimation. That is, the heights of the elementary hills around each 
site differ in the superposition process. For example, from the archaeological point 
of view, an appropriate weight is the number of findings at a site. 

In the following we consider such weighted bivariate density estimation of all 
locations. Concretely, herein they are weighted by the logarithm of the number of 
findings plus 1 . The “4- 1 ” anticipates zero weights. The result is shown in Fig. 4. In 
order to compare different maps throughout the paper we used both the same area 
as presented in Fig. I and the same bandwidth equals 350 for the x-coordinate (east- 
ing) and 250 for the y-coordinate (northing or equatorial distance). In our opinion 
this accounts for the higher spread in the x-direction. The bandwidth selection by 

0,45 




Fig. 3 The principle of density estimation in a graph 
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Fig. 4 Density surface of all findspots with regard to their size 



cross-validation technique recommends much smaller bandwidths resulting in non- 
smooth surfaces that are inapplicable for the aim at doing comparisons. Moreover, 
the x-j-base area is congruent for all maps. Technically this can be guaranteed eas- 
ily by using special weights for the extreme points that are located in the corners. 
Concretely these points have either the usual weights (if they are in the selection of 
hndings of the corresponding time period) or they get very small weights otherwise 
(and hence they have no effect on the density estimation). 



4 Comparison of Different Periods 

First we show the bivariate density estimation of the locations of the period of 
Emperors Claudius and Nero (41-68 AD), see Fig. 5. The density is dominated by 
two dense regions that are located far from the river Rhine. 

Figure 6 shows the bivariate density estimation of the locations that belong to 
the Flavian period (69-96 AD). Now another dense region occurs: the area of civil 
settlement down at the river Rhine. 

In Fig. 7, the bivariate density estimation of the locations that belong to the brick- 
stamps from third and fourth centuries (i.e.. Emperors Caracalla, Constantine I or 
Julian, and Valentinian I) is presented. Here the area of settlement is expanded 
without any outstanding dense regions. 
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Fig. 5 Density surface of findspots from 41-68 AD 




Fig. 6 Density surface of findspots from 69-96 AD 
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Fig. 7 Density surface of findspots from third and fourth centuries 




Fig. 8 Cuts of the density surface of Fig. 6 at several levels 
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Fig. 9 Archaelogical dissection of the area into the three main regions “Area 1”, “Area 2”, and 
“Area 3” 



Table 1 Contingency table of 1,334 findings which are divided into three different time periods 
and three spatial regions 



Period 


Area 1 


Area 2 


Area 3 


Sum 


Claudius and Nero 


219 


205 


235 


659 


Flavian period 


304 


160 


130 


594 


Third and fourth centuries 


3 


17 


61 


81 


Sum 


526 


382 


426 


1,334 



Another output of the nonparametric density estimation is presented in Fig. 8. 
This graph shows several cuts of the density surface of the findspots that belong to 
the first part of the Flavian period. Usually one can try a segmentation of regions by 
density clustering. But keep in mind that there are many areas of Mainz without any 
action of excavation (gaps). Without any doubt the graph shows three dense regions. 

Figure 9 shows a coarse archaeological segmentation that can be supported by 
nonparametric density estimation to a high degree. Additionally, here the identihca- 
tion number of each site is presented. 

Table 1 is a frequency table that crosses three regions (classes of archaeolog- 
ical segmentation) in the rows with three main time periods in the columns. The 
chi-square test of independence is rejected: the chi-square statistic (degrees of free- 
dom =4) is T = 126.46 and is quite highly signihcant. That means that there are 
dependencies between areas and periods. Figure 10 shows the graphical output of 
the correspondence analysis of Table 1. 
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A 
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Fig. 10 Correspondence analysis plot of the rows and columns of Table 1 . The first axis (abscissa) 
accounts for 96% of the total statistic 



5 Conclusions 

In combination of mapping and looking for dated brickstamps there is a good chance 
to achieve new sources for the Roman history of Mainz. The continuous density 
surface based on weighted density estimation allows an easy comparison of periods 
by eye. From the statistical point of view the gaps in the spatial distribution make 
it difficult to use standard method like clustering for segmentation of regions. The 
database has to be completed by many hundreds of findings that are not yet identified 
by their coordinates of location. 

For more details of recent research into Roman bricks and tiles see http://www. 
ziegelforschung.de. 
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Analysis of Guarantor and Warrantee 
Relationships Among Government Officials 
in the Eighth Century in the Old Capital 
of Japan by Using Asymmetric 
Multidimensional Scaling 



Akinori Okada and Towao Sakaehara 



Abstract The relationships among government officials, working in the old capital 
of Japan at the eighth century, were analyzed by using asymmetric multidimen- 
sional scaling. The data consist of the relationships of giving guarantee (guarantor) 
and being given guarantee (warrantee) relationships among 26 government officials 
in the years 772, 773, and 774. The guarantor and warrantee relationships are inevi- 
tably asymmetric, and this is the reason why asymmetric multidimensional scaling 
was employed in the present analysis. The obtained result shows that while the 
asymmetry in the guarantor and warrantee relationships is not large, the result along 
with the result of asymmetric cluster analysis showed who are dominant persons 
from a standpoint of being guarantors for colleagues. The result also suggests some 
information on the occupation of the government officials whose occupations are 
unknown. 

Keywords Asymmetry ■ Guarantor and warrantee relationships ■ Historical data ■ 
Multidimensional scaling. 



1 Introduction 



The relationships among lower ranked government officials (all of them are male) 
in the eighth century in the old capital of Japan are dealt with. Some government 
officials borrowed money, and they asked some colleague(s) to be their guarantor 
for them (warrantee). When one asked the other(s) to be the guarantor for him, this 
shows that the former is less dominant than the latter. Who is the guarantor for 
whom among persons of a group represents the relative dominance in the guarantor 
and warrantee relationships. The amount of money guaranteed shows the dominance 
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of the guarantor in the guarantor and warrantee relationship. The relationship from 
the warrantee to the guarantor is not same as that from the guarantor to the war- 
rantee. The guarantor and warrantee relationships are asymmetric, and asymmetric 
multidimensional scaling was employed to analyze the present data. 

Asymmetric multidimensional scaling has been used to analyze various sorts 
of asymmetric relationships not only those among persons (Okada & Imaizumi, 
2000) but also among brands (Borg & Groenen, 2005; Harshman, Green, Wind, 
& Lundy, 1982; Okada, 1988), regions or nations (Okada & Imaizumi, 2003; 
Zielman & Heiser, 1993) occupations (Okada & Imaizumi, 1997), word associa- 
tions (Harshman et ah; Zielman & Heiser, 1996), and journal citations (Weeks & 
Bender, 1982). The present study deals with the asymmetric relationships among 
government officials more than 1 ,200 years ago. The purpose of the present study 
is to disclose the relationships formed by the guarantor and the warrantee among 
government officials working at the government office in the eighth century; to hnd 
groups of government officials formed by the guarantor and warrantee relationships, 
and to find who are dominant persons (gave guarantee more than the others did) 
among them. This will help to disclose an aspect of relationships among low ranked 
government officials 1,200 years ago. 



2 Data 

The data analyzed in the present study are based on a set of documents, written on 
wooden plates, which show guarantor and warrantee relationships among govern- 
ment officials. The documents have been kept in the government warehouse called 
Shoso-in (Sakaehara, 1987). The government officials were working in the old 
capital of Japan called Heijo-kyo (called Nara presently). They engaged in copy- 
ing the Buddhist sutra there. The documents tell (a) the name of the borrower, (b) 
the amount of money he borrowed, (c) the name of the guarantor for the borrower 
(warrantee), and (d) the date of the borrowing in the years 772, 773, and 774 (con- 
sist of 90 cases in 774 and 15 cases in 775, and will be denoted 774 hereafter). They 
lodged in the capital. While they were paid by the government, they sometimes bor- 
rowed money from the government office or asked the government office to pay 
their salary before their payday. They needed the person(s) who gave guarantee for 
him to get loans from the government office. 

The documents recorded the guarantor and warrantee relationships among 97 
government officials. In the present study, 26 government officials, who gave guar- 
antee or be given guarantee more than five times from 772 through 774, were 
selected. The reason of analyzing the relationships among these 26 government offi- 
cials is that when all 97 persons were dealt with, a large number of relationships 
between two persons are null; neither be a guarantor nor be a warrantee. From the 
documents, a table which shows the guarantor and warrantee relationships among 
26 government officials was constructed. The {j, k) element of the table shows the 
amount of money the government official corresponding to row j borrowed which 
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was guaranteed by the government official corresponding to column k. One table 
was constructed for each of the years 772, 773, and 774; there are three 26 x 26 
tables. 



3 The Method 

When the amount of the money borrowed or guaranteed is regarded as the prox- 
imity or similarity from the warrantee to the guarantor, the (y, k) element of the 
table represents the proximity from the government official corresponding to row 
j to the government official corresponding to row k. Each table is asymmetric. 
The three 26 x 26 tables are two-mode three-way ([government official] x [govern- 
ment official] X [year]) asymmetric proximities. Two-mode three-way asymmetric 
multidimensional scaling was used to analyze the present data. 

The model of two-mode three-way asymmetric multidimensional scaling used 
in the present study is derived from the predecessor (Okada & Imaizumi, 1997). A 
model with a constraint on the asymmetry weight (Okada & Imaizumi, 2000) was 
used in the present study. The present model inherits the property of the orienta- 
tion of dimensions from the predecessor; the orientation of dimensions is uniquely 
determined up to refiections and permutations, and the rotation of the dimensions is 
not allowed. 

The model of the two-mode three-way asymmetric multidimensional scaling 
used in the present study consists of the common object configuration, the sym- 
metry weight, and the asymmetry weight (cf. Carroll & Chang, 1970). The com- 
mon object configuration represents the relationships among government officials 
which are common to all three years. In the common object configuration each 
government official is represented as a point and a circle (in a two-dimensional 
space), a sphere (in a three-dimensional space), or a hypersphere (in a more than 
three-dimensional space) centered at that point in a mnltidimensional Euclidean 
space. In the present model, interpoint distances represent the symmetric guaran- 
tor and warrantee (or proximity) relationships among government officials. The 
radius represents asymmetric guarantor and warrantee (or proximity) relationships 
among government officials. The larger radius means that the corresponding person 
guarantees less and be guaranteed more. And the smaller radius means that the cor- 
responding person guarantees more and be guaranteed less. This suggest that when 
a government official has the smaller radius, he is relatively more dominant, and 
that when a government official has the larger radius, he is relatively less dominant 
in the guarantor and warrantee relationships among them. 

Each year has a symmetric weight. The symmetric weight for a year represents 
the salience of the symmetric relationships among government officials in the year. 
Each year has a set of asymmetric weights. The asymmetric weight represents the 
salience along each dimension in the asymmetric relationships among government 
officials in the year. The asymmetric weight has a condition that the ratio of any two 
asymmetry weights within the same year is constant over three years (see Eig. 2). 
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A nonmetric algorithm to obtain the common object configuration, the sym- 
metry weight, and the asymmetry weight from the observed two-mode three-way 
asymmetric proximities was extended from the algorithm introduced by Okada and 
Imaizumi (1997). The badness-of-fit measure of the model to the observed proxim- 
ities is called stress. The algorithm tries to obtain the common object configuration, 
the symmetry weight, and the asymmetry weight which minimize the stress in a 
Euclidean space of a given dimensionality. The algorithm iteratively improves the 
initial configuration and values by the steepest descent method. 



4 The Analysis and the Result 

The analysis begins in the maximum-dimensional space, and continues through the 
unidimensional spaces. In the present study, we used five, six, seven, eight, and nine 
as the maximum dimensionality. After executing the analysis with five, six, seven, 
eight, and nine as the maximum dimensionality, one stress value was obtained in 
the nine-dimensional space, two stress values in the eight dimensional space, . . . , 
five stress values in five and lower dimensional spaces. From each of five- through 
unidimensional spaces, the smallest stress was chosen as the minimized stress value 
in each of five- through unidimensional spaces. These figures, from uni- through 
five-dimensional spaces, were 0.332, 0.260, 0.215, 0.184, and 0.161. From these 
stress values and the interpretation of the results, the four-dimensional configuration 
was chosen as the solution. 

The four-dimensional common object configuration of 26 government officials 
is shown in Fig. 1. In Fig. I each government official is represented as a point and 
a circle centered at that point in the two-dimensional space of Dimensions 1 and 
2 (Fig. la) and that of Dimensions 3 and 4 (Fig. lb). As described earlier, the ori- 
entation of dimensions is uniquely determined up to reflections and permutations. 
Each member from 1 to 26 represents the location of a government official in the 
two-dimensional space. The italicized number shows that his location is represented 
by a plus sign not by the number itself. Sometimes a small line is used to indicate 
the location of a person. 

In Fig. 1 government official 21 has the largest radius, and 14, who is represented 
by a small open diamond shaped symbol, has the smallest radius which is zero 
by the definition (Okada & Imaizumi, 1997, p. 210). To avoid the complex figure, 
only three largest radii (21, 24, and 7), and the three smallest radii (14, 12, and 19) 
are shown in Fig. 1 . The radii of the other 20 government officials are omitted. In 
the present data, the radius of a hypersphere represents asymmetric relationships 
among government officials. The larger the radius of a government official is, the 
smaller amount he gives guarantee to the others and the larger amount he is given 
guarantee. Or a government official having the larger radius is less dominant than 
the person having the smaller radius, and the person having the smaller radius is 
more dominant than the person having the larger radius. Government official 21 has 
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(a) Two-dimensional configuration of 
Dimensions 1 and 2 




Fig. 1 Four-dimensional common object configuration of 26 government officials 
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Year 774 □ 
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(a) Asymmety weight along 
Dimensions 1 and 2 
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Dimension 4 



Year 774 □ 
Year 772 □ 



Year 773 □ 



0.5 



1.0 

Dimension 3 



(b) Asymmety weight along 
Dimensions 3 and 4 



Fig. 2 Asymmetry weight along dimensions 



the largest radius, suggesting that he is least dominant, and 14 has the smallest radius 
of zero, suggesting that he is most dominant among 26 government officials. 

Each of three years has the symmetry weight; 0.319, 0.321, 0.319 for the years 
772, 773, and 774. The salience of the symmetric relationships among government 
officials is slightly larger in 773 than in 772 and 774. Figure 2 shows the asymmetry 
weight along dimension. Figure 2a shows the asymmetry weight along Dimensions 
1 and 2, and Fig. 2h shows the asymmetry weight along Dimensions 3 and 4. Three 
points, each represents a year, in the two-dimensional configuration are on a line 
emitting from the origin, because the ratio of any two asymmetry weights within 
a year is assumed to be constant over three years. Dimension 1 has the largest 
asymmetry weight, and the asymmetry weight along the dimension decreases from 
Dimensions 1 through 4. These suggest that the asymmetry in the relationships 
among government officials along Dimension 1 is largest, and that the asymme- 
try decreases from Dimensions 1 through 4. Asymmetry weights for the year 773 
are smallest among the three years, which is compatible with the symmetry weight 
mentioned above. 



5 Discussion 

In the present study 26 government officials are dealt with. While the amount of 
money one borrowed or guaranteed is known, the characteristic of each person is 
not known at all except the rank and occupations (two categories) at the government 
office. While there were two levels of the rank of 97 government officials, the 26 
government officials selected from them were only the higher ranked. The higher 
rank includes two occupations. Government officials 5, 20, and 24 were engaged 
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in the different occupation from the others (occupations of government officials 10 
and 22 are unknown). It is difficult to interpret the obtained configuration from the 
characteristics of these 26 government officials. 

The asymmetric cluster analysis (Okada, 2000; Okada & Iwamoto, 1996) was 
applied to each data of the years 772, 773, and 774. This is because the asymmetric 
cluster analysis can deal only with one-mode two-way proximities, but cannot deal 
with two-mode three-way proximities. 

Figure 3 shows the dendrogram derived by analyzing the data of the year 773. In 
Fig. 3 there are four clusters (from left to right each cluster consists of five, three, 
three, and two government officials). Because there are a lot of null cells in the 
table, all four clusters and remaining singletons (13 government officials) joined into 
one cluster when the amount of money borrowed or guaranteed is zero. This final 
stage of the asymmetric cluster analysis is meaningless, this is why four clusters are 
remained separated. In Fig. 3 a dominant government official (emboldened) in each 



Amount of 0 - 
money 
borrowed or 
guaranteed 



Year 773 




500 ■ 




1 ,000 ■ 





3 20 24 5 15 2 6 89 12 16 4 21 



Fig. 3 Asymmetric cluster analysis of the year 773 data 
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cluster is disclosed who absorbed the others. In the leftmost cluster, government 
official 15 absorbed 5, 3, 20, and 24. Government official 15 is the dominant person 
in this cluster. From the dendrogram derived by analyzing the data of 772, 773, 
and 774, there are six clusters in 772, four clusters in 773, and two clusters in 774 
as shown below. The emboldened number represents the government official who 
is the dominant person in the cluster which was constructed by the person(s) he 
absorbed and himself. 

Year 772 

Cluster 7721: 23, 7, 9, 15, 16 

Cluster 7722: 14 , 4, 1 1, 12, 17, 18, 20, 21, 25, 26 

Cluster 7723: 5,1,24 

Cluster 7724: 10 , 22 

Cluster 7725: 19,8 

Cluster 7726: 26, 22 

Year 773 

Cluster 7731: 15,3,5,20, 24 
Cluster 7732: 8,2,6 
Cluster 7733: 16 , 9, 12 
Cluster 7734: 4,21 
Year 774 

Cluster 7741: 25, 4, 5, 6, 8, 13, 17, 18, 19, 21 
Cluster 7742: 14 , 3, 12 

There are 12 clusters, and 12 dominant government officials are disclosed. 
Government official 14 is dominant in the years 772 and 774. He seems to be 
most dominant in 772 (absorbed nine colleagues in 772), but he seems to be sec- 
ond dominant in 774 (he absorbed two colleagues in 774), suggesting that while 
he had been loosing the dominance, he still was dominant in 774. Of the 12 domi- 
nant persons, eight of them have the radius not larger than the mean of the radius. 
This seems to modestly validate that the radius represents the dominance of a gov- 
ernment official. For each year the difference between column j sum (total amount 
government official j gave guarantee for the others) minus row j sum (total amount 
government official j was given guarantee from the others) of each government 
official {j = 1 , . . . , 26) in the table was calculated. The sum of differences of three 
years was calculated for each government official. The correlation coefficient of 
the radius and the sum of differences is 0.84. This figure also seems to validate 
that the radius represents the dominance of a person in the guarantor and warrantee 
relationships. 

Some of the clusters are located nearly parallel with dimensions. Geometric rela- 
tionships between four dimensions and the location of the cluster in the common 
object configuration were examined. Clusters 7723, 7733, and 7742 are nearly paral- 
lel with Dimension 1. Clusters 7731 and 7732 are nearly parallel with Dimension 2. 
Clusters 7722, 7733, and 7742 are nearly parallel with Dimension 3. The asymmetry 
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weight decreases from Dimensions 1 through 4. The clusters which are parallel with 
Dimension 1 have the largest asymmetry in the relationships within government 
officials in these clusters, and the clusters parallel with Dimensions 2 and 3 have 
smaller asymmetry than those which are parallel with Dimension 1 have. 

Cluster 7731 (the leftmost cluster in Fig. 3) consists of five government officials 
(15, 3, 5, 20, and 24). While, as mentioned earlier, three (5, 20, and 24) of them 
were engaged in the different occupation from the others (government officials 
15 and 3), they (5, 20, and 24) are in the same cluster with government officials 
15 and 3 engaged in the different occupation from them. Two government officials 
(10 and 22) whose occupations are unknown consist Cluster 7724, and they do not 
belong to the other clusters. But they are close each other, and close to govern- 
ment officials 5, 20, and 24 along Dimensions 1, 2, and 3. But they are distant 
each other, and from the others along Dimension 4. This means that they are close 
each other and to three persons (5, 20, and 24) engaged in the different occupation 
from the others, but are different each other and from the other persons on the fea- 
ture of Dimension 4. This seems to give some information on the occupations of 
government officials 10 and 22. 

The information on lower ranked government officials like those dealt with in the 
present study is extremely limited. Thus, while it is difficult to interpret the results 
by using another kind of information on them, the present results can throw some 
light on the relationships among lower ranked government officials 1,200 years ago, 
which can be a starting point of further studies on the relationships. 

As the final remarks, three aspects of the present data should be mentioned. 
Firstly, the number of non-zero cells of the data is not large in each table. Of 650 
non-diagonal cells in each table, 59 cells in the year 772, 12 in 773, and 48 in 774 
are non-zero. The remaining elements are zero or there is no guarantor and warran- 
tee relationship, suggesting there are a lot of ties in the data. This implies a lot of 
symmetric relationships among government officials. There are 29 asymmetric rela- 
tionships between two government officials among 325 two person relationships in 
the year 772, seven in 773, and 26 in 774, suggesting the asymmetry in the present 
data is very small. The second is that the information on the 26 government officials 
is extremely limited, and the information, such as the age, from where he had come 
to the capital, or how long he had been working in the government office, is not 
known. While this would be inevitable in historical data, this causes the difficulty 
in interpreting the obtained result. Finally, the data of 26 of 97 government officials 
were dealt with in the present study, because the others have only weak or no guaran- 
tor and warrantee relationships. Procedures other than asymmetric multidimensional 
scaling might be able to disclose the guarantor and warrantee relationships among 
all 97 government officials. 
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Analysis of Massive Emigration from Poland: 
The Model-Based Clustering Approach 



Ewa Witek 



Abstract The model-based approach assumes that data is generated by a hnite 
mixture of probability distributions such as multivariate normal distributions. In 
hnite mixture models, each component of probability distribution corresponds to a 
cluster. The problem of determining the number of clusters and choosing an appro- 
priate clustering method becomes the problem of statistical model choice. Hence, 
the model-based approach provides a key advantage over heuristic clustering algo- 
rithms, because it selects both the correct model and the number of clusters. 

Model-based clustering has shown its potential in a number of practical appli- 
cations, including tissue segmentation, character recognition, mineheld and seismic 
fault detection and classihcation of astronomical data. The article presents the appli- 
cation of the model-based clustering in economic analysis, which is comparatively 
rare. 

The moment Poland joined the EU, many citizens left the country. Since 1 May 
2004 Poland has been facing the problem of increased emigration. We used the 
model-based clustering approach for grouping and detecting inhomogeneities of 
Polish emigrants from different regions of Poland. 

Keywords Classification • Migration flows • Model-based clustering. 



1 Introduction 

To mark the fourth anniversary of the enlargement of the European Union, we 
have undertaken a major study that aims to provide as definitive a picture of post- 
enlargement Polish migration flows to the UE as possible. This article presents fresh 
evidence on the scale and nature of migration from Poland (since it joined the EU 
four years ago). In May 2004 Ireland, Sweden and the UK opened up their labour 
markets to citizens of the new member states in Central and Easter Europe. 
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Following the EU enlargement in May 2004, the borders of Ireland, Sweden 
and UK were officially opened to employees from the new member states. The 
Netherlands followed suit in May 2007. However, it is incorrect to imply that there 
was no labour migration from Poland to the Netherlands before 2007. The Nether- 
lands was the second main destination of choice for migrants from Silesia, formerly 
the German territory. Due to their dual Polish-German citizenship, the “German- 
Poles” have enjoyed free access to the Dutch labour market since the early 1990s. 
The most popular countries of Polish migration are the UK and Ireland. 

According to GUS (the Central Statistical Office), since 2004 2 million Polish 
citizens have left the country (only 600,000 to the UK and 120,000 to Ireland), but 
the number of emigrants is considerably underestimated, because we do not have 
a complete insight into the volume of temporary and irregular emigration. Other 
estimates suggest that one million Poles have moved to the UK. Some 83% of them 
are under 34. 

This benign invasion of eager young Poles has, undoubtedly, played a very pos- 
itive role in the British economy. But has it also had a positive impact on Poland, 
where 10.5% of inhabitants are unemployed and the average wage national is just at 
5,226 pounds per year? 

As a result of migration, there are labour shortages in several sectors of the Polish 
economy - in services, trade, the building industry and science. The most acute 
problem of all, however, is in health care. Approximately 5,000 doctors have left 
Poland over the past two years. In Lower Silesia, where Wroclaw is located, a quar- 
ter of all anaesthetists have applied for the special certihcate that allows them to 
work abroad. Poland’s underfunded health service is also running out of nurses. An 
other serious problem is a shortage in the construction industry. Officials estimate 
Poland needs to attract 200,000 workers back into the country if it is to get stadiums 
and facilities ready for the 2012 European Eootball Championship. 



2 Model-Based Clustering 
2.1 Mixture Models 



In model-based clustering, the data x,- = [x,i , x, 2 , . . . , are viewed as coming 
from a mixture density /(x) = \ where ft is the probability density 

function of the observations in group k, and Tit is the probability that an observation 
comes from the A:th mixture component {jtk e (0, 1) and jtk = 1). 

Each component is usually modeled by the normal or Gaussian distribution. 
Component distributions are characterized by the mean fj.k and the covariance 
matrix 'Sk, and have the probability density function; 



<I)^.(X,■; /ti;, Tk) 



exp[-i(x; ~ flk)] 

^/det(2JtY,k) 



( 1 ) 
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For univariate data, the covariance matrix reduces to a scalar variance. The likeli- 
hood for data consisting of n observations assuming a Gaussian mixture model with 
G multivariate mixture components is 



Data generated by mixtures of multivariate normal densities are characterized 
by groups or clusters centered at the means fik, with increased density for points 
nearer the mean. The corresponding surfaces of constant density are ellipsoidal. 
Geometric features (shape, volume, orientation) of the clusters are determined by 
the covariances which may also be parametrized to impose constraints across 
components. There are a number of possible parametrizations of Tk, many of 
which are implemented in the R package me lust (software available at http://cran. 
r-project.org/web/packages/mclust/). Common instances include lij; = AI, where 
all components are spherical and of the same size; li <; = li constant across compo- 
nents, where all components have the same geometry but need not be spherical; and 
unrestricted Jlk, where each component may have different geometry. 

Banfield and Raftery (1993) proposed a general framework for geometric con- 
straints in multivariate normal mixtures by parametrizing covariance matrices 
through eigenvalue decomposition in the following form: 



where Dt is the orthogonal matrix of eigenvectors, is a diagonal matrix whose 
elements are proportional to the eigenvalues, and Xt is an associated constant of 
proportionality. The decomposition factors X^, and are treated as indepen- 
dent sets of parameters, and either constrained to be the same for each component or 
allowed to vary among components. When parameters are fixed, components share 
certain geometric properties: Dt governs the orientation of the Ath component of 
the mixture, Ak its shape, and X^ its volume, which is proportional to A™ det A^. 

The geometric properties of models available in the R package me lust are 
shown in the Fig. 1 

2.2 Parameter Estimation and Model Selection 

The EM algorithm is a general approach to maximum likelihood in the presence of 
incomplete data and it is one of the most popular algorithms used in model-based 
clustering. For details of this algorithm, see Dempster and Laird (1977). 

When the EM algorithm is used to find maximum mixture likelihood, a reli- 
able approximation to twice the log Bayes factor called the BIG (Schwarz, 1978) is 
applicable: 



n G 




( 2 ) 



Y.k = XkDkAkDl , 



(3) 



BICk = 2\ogp{x\ek.Mk)-Vk\og{n), 



(4) 
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Fig.l Parametrizations of the multivariate Gaussian mixture model available in mclust package 



where log p(x\0it, Mk) is the maximized mixture loglikelihood of the data for the 
model Mk, v^- is the number of independent parameters to be estimated in the model, 
« is a number of observations. In BIC, a term is added to the likelihood penalizing 
the complexity of the model, so that it may be maximized for more parsimonious 
parameterizations and smaller numbers of groups than the loglikelihood. Accord- 
ingly, the larger value of the BIC, the stronger evidence of the model. The BIC 
can be used to compare models with differing parametrizations, differing number 
of components, or both. Bayesian criteria other than BIC have been used in cluster 
analysis. The performance of some of these criteria were compared by Biernacki, 
Celeux, and Govaert (1999). Model choice based on BIC has given good results 
in a range of applications of model-based clustering (i.e., Fraley & Raftery, 2002; 
Stanford & Raftery, 2000). 



2.3 Model-Based Strategy for Clustering 

Fraley and Raftery (1998) obtained good results in a number of examples by using 
the partitions produced by model-based hierarchical agglomeration as starting val- 
ues for an EM algorithm for unconstrained Gaussian models, together with the BIC 
to determine the number of clusters. Their approach forms the basis for a general 
model-based strategy for clustering: 

1. Determine a maximum number of clusters, M , and a set of mixture models to 
consider. 

2. Perform hierarchical agglomeration to approximately maximize the classification 
likelihood for each model, and obtain the corresponding classifications for up to 
M groups. 
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3. Apply EM algorithm for each model and each number of clusters 2, M, 
starting with the classifications from hierarchical agglomeration. 

4. Compute BIC for the one-cluster case for each model and for the mixture model 
with the optimal parameters from EM for 2, M clusters. 

5. Plot the BIC values for each model. A decisive first local maximum indicates 
strong evidence for a model (parametrization and number of clusters). 

Eor details of the model-based clustering, see McLachlan and Peel (2000) and 
Eraley and Raftery (2002). 



3 Example 

The analysis of Polish migration flows delivers the estimated ratio of emigrants 
to the population from different regions of Poland - we considered the number of 
registered departures from the country for a permanent stay abroad divided by a 
number of residents in a given subregion. The two other variables considered are: 
the unemployment rate (the number of the registered unemployed at the end of the 
year) and the average wage (average monthly gross wage and salaries) in different 
subregions of Poland. The period of the analysis was constrained to the period of the 
year 2006 - the year of the highest emigration ratio (2007 data have not been made 
available by the Polish Statistical Office yet). The basic statistical measurements 
of the data are presented in Table 1. We investigated also the distribution of each 
variable using Shaprio and Kolmogorov-Smirnov normality tests. We observed a 
certain departure from normality for the emigration ratio variable. Although the 
assumption of normality of each variable was not fulfilled, we decided to use model- 
based clustering, as the distribution of two out of three variables was normal. 

Most clustering done in practice is based largely on heuristic but intuitively 
reasonable procedures. One widely used class of methods involves hierarchical 
agglomerative clustering. Another common class of methods is based on iterative 
relocation, in which data points are moved from one group to another until there 
is no further improvement in some criterion. Iterative relocation with the sum of 
squares criterion is called A:-means clustering (MacQueen, 1967). Although there 
has been considerable research in this area (e.g., dendrogram analysis for hierarchi- 
cal clustering), there is little systematic guidance associated with these methods 
for solving basic practical questions that arise in cluster analysis, such as how 



Table 1 The ba.sic statistical measurements of the data 





Emigration ratio 


Unemployment rate 


Wage 


Min 


0.0007 


4.6 


1,979 


Max 


0.733 


28.9 


3,790 


Mean 


0.836 


16.16 


2,364 


Standard deviation 


0.180 


5.529 


344.162 
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many clusters there are, which clustering method should he used, and how out- 
liers should be handled. Moreover, the statistical properties of these methods are 
generally unknown, precluding the possibility of formal inference. It has also been 
shown that some of the most popular heuristic clustering methods are approximate 
estimation methods for certain probability models. For example, standard k-means 
clustering and Ward’s method are equivalent to known procedures for approximately 
maximizing the multivariate normal classification likelihood when the covariance 
matrix is the same for each component and proportional to the identity matrix. Nei- 
ther hierarchical nor relocation methods directly address the issue of determining 
the number of groups within the data. Various strategies for simultaneous determi- 
nation of the number of clusters and cluster membership have been proposed (Bock, 
1998; Bozdogan, 1993; Engelman & Hartigan, 1969). Model-based clustering is an 
alternative which is described and applied to the migration data in this paper. 

The model-based clustering methodology outlined in Sect. 2.3 yields the results 
shown In Fig. 2. The maximum BIC value occurs for the two-group WI (the model 
in which the volumes of clusters are different, shapes of the clusters vary, clusters 
are oriented according to the coordinate axes). The difference in BIC values between 
the two- and four-group models is small enough to conclude that there are either two 
or four groups in the data. Because in clustering, group memberships are not known 
in advance, we cannot access the error rate in each case. We accessed the cluster 
structure for VVI,2 and VVI,4 models using Rousseeuw’s Silhouette cluster quality 
index (Kaufman & Rousseeuw, 1990). Clustering for the model with two clusters 
yields the higher value of Silhouette index equals 0.61. 

Of note is that in Fig. 2 no values of the BIC are given for model VII of seven 
clusters and for models VVI and VEV of seven clusters each. In these cases, the 
covariance matrix associated with one or more of the mixture components is ill- 
conditioned, so the loglikelihood and hence BIC cannot be computed. Because the 



o 




number of components number of components 



Fig. 2 The Bayesian information criterion (BIC) for model-based clustering applied to the 
migration data 



Polish Emigration 



621 





emigration_ratio emigration_ratio 

Fig. 3 Left: classification; right: uncertainty 



data are three-dimensional, a minimum of four points is required for the estimate of 
the covariance matrix to be non- singular. 

The classification assignment for two classes is shown in Fig. 3. The compo- 
nent means are marked and ellipses with axes are drawn corresponding to their 
covariances. 

The right side of Fig. 3 is a projection of the Polish emigration data show- 
ing classification uncertainty (Fraley &. Raftery, 1998, p. 580; Fraley & Raftery, 
2002, p. 614). Larger symbols indicate the more uncertain observations (subregions: 
kluczbork, bielsko-biala, Lodz city). 

Since we do not know the group memberships in advance, we can not check the 
quality of the clustering measured by the error rate and compare it with standard 
A:-means clustering and Ward’s method. A difficulty of some of the more heuristic 
clustering algorithms is the lack of a statistically principled method for determining 
the number of clusters. Model-based clustering is an inferentially based procedure 
- use model selection methods to make this decision. 



4 Conclusions 

The paper explores an official source of data which allow to characterize Polish emi- 
grants and their migration behavior. They give a more general insight into groups 
when emigration becomes significant in terms of the number of emigrants compared 
to the population in different parts of Poland. We might assume that unemployed 
workers and people with lower wages are the most likely to leave the country. FIow- 
ever, the model-based clustering analysis yields two groups of Polish emigrants. 
One group, which has a higher emigration rate, has much higher wages and lower 
unemployment rates then the other one. 
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Fig. 4 The biggest subregions of the second class 

Different reports e.g., Budnik (2007) show that low wages and the high unem- 
ployment rate were the main reasons of emigration but just after the UE accession 
(2004). We think that people in the hrst group were driven by a desire not only 
to earn money, but also a will to travel, gain new experiences and improve their 
language qualihcations. Hence, the hrst class is formed mainly by emigrants from 
big cities, where people are much brave and better educated (access to university 
education). Those are people open to Europe. The biggest subregions of the second 
class are illustrated in Fig. 4. In contrast to this class, the other class is made up 
of the emigrants from the less industrial and developed regions of Poland, whose 
main emigration goal was to earn money. Our analysis and sociological research 
also suggest that changes in the immigration policies in most countries, together 
with different incentives and benehts might have an impact on emigrant. In particu- 
lar, the open door policy might relatively strongly encourage emigration of workers 
previously employed in the home country. 

The spatial analysis of migrations in different areas allows to determine the 
structure of population movement by directions and stream volumes, as well as 
migrations balances in urban and rural areas. The analysis of regional and urban 
patters can be found in, e.g., Kupiszewski (2006) and Potrykowska (2007). The 
analysis provided by literature confirms that major urban agglomerations have areas 
that are permanent sources of migrants and a more or less constant migration struc- 
ture. Migrations to urban agglomerations are of regional nature and clearly depend 
on the distance. 
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One crucial question related to migration is whether it is a temporary or perma- 
nent phenomenon. What complicates the answer is the fact that social researchers 
cannot decide which argument is right. Asking emigrants whether it is their inten- 
tion to stay abroad is not a very useful method, since most of them do not know what 
the future will bring. We can observe that many Poles come back. We think that this 
is the group who just aimed to earn money. The starkest reason to think so might be 
that the Polish zloty has risen against the pound by 20% in 2007. In addition, wages 
in Poland are rising fast. Overall, UK wages compared to Polish wages measured 
in the zloty have fallen by a quarter. Unfortunately, most emigrants, specially those 
with higher education, stay abroad. 

We hope that those young people will miss the country and come back with 
excellent language skills and valuable job experience. 



5 Discussion 

We have discussed the clustering methodology based on multivariate normal mix- 
ture models and shown its application in the clustering of Polish emigrants. This 
approach, however, has some limitations. Firstly, computational methods for hier- 
archical clustering have storage and time requirements that grow at rate faster than 
linear compared to the size of the initial partition, so they cannot be directly applied 
to large datasets. The practical use without modification can be limited to high- 
dimensional datasets. Secondly, although experience to date suggests that models 
based on multivariate normal distribution are sufficiently flexible to accommodate 
many practical situations, the underlying assumption is that groups are concentrated 
locally in linear subspaces, so the mixture of other distributions (i.e., f-distributions) 
or other clustering methods may be more suitable in some cases. 

We would also like to note that we have analyzed the number of registered depar- 
tures from the country for a permanent residence abroad - the analysis does not 
encompass the seasonal emigration of Poles, which, due to their duration, are sub- 
ject to facilitated official and legal procedures or are not registered at all in most EU 
countries. 
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Systematics of Short-Range Correlations 
in Eukaryotic Genomes 



Jorn Hameister, Werner E. Helm, Marc-Thorsten Hiitt, and Manuel Dehnert 



Abstract Attempts to identify a species on the basis of its DNA sequence on purely 
statistical grounds have been formulated for more than a decade. Solving this prob- 
lem could have a huge impact on understanding processes of genome evolution and 
on the design of classification schemes for DNA sequences. 

We have shown previously that patterns in the short-range statistical correlations 
in DNA sequences serve as evolutionary fingerprints of Eukaryotic genomes. All 
chromosomes of a species display the same characteristic pattern, markedly differ- 
ent from those of other species. The chromosomes of a species are sorted onto the 
same branch of a phylogenetic tree due to this correlation pattern. 

Here we summarize these results from our group, highlight some of the algorith- 
mic challenges involved and discuss the overall picture on the relation between sta- 
tistical sequence properties and biological processes of genome evolution emerging 
from these and other findings. 

Keywords Automated classification • DNA sequences • Genome • Phytogeny. 



1 Introduction 



An automated classification of DNA sequences based on their statistical properties 
is one of the current challenges of metagenomics projects (Pride, Meinersmann, 
Wassenaar, & Blaser, 2003; Teeling, Meyerdierks, Bauer, Amann, & Gldckner, 
2004; McHardy, Martin, Tsirigos, Hugenholtz, & Rigoutsos, 2007). Beyond word 
counts (Karlin & Mrazek, 1997; Schbath, 1997; Qi, Wang, & Hao, 2004) and 
visualizations based on several scales (Goldman, 1993), statistical correlations can 
contribute to this task, as they have proven to be quite informative, both for distin- 
guishing species and for distinguishing functional categories of DNA sequences. 
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While this classification task is particularly important for micro-organisms 
(which are the dominant component in metagenomics data), we would like to show 
here that also for Eukaryotes the classification schemes derived from model based 
short-range correlations are extremely informative, both for better understanding 
processes of genome evolution and for comparing the systematics with phylogenetic 
studies. 

Appropriate representations of phylogenetic relationships are often cited as one 
of the major mathematical challenges in biology (Cohen, 2004). To a certain extent, 
statistical properties of DNA sequences may serve as a more objective approach to 
analyzing species relationships from a genome-wide perspective (Karlin & Mrazek, 
1997; Gentles & Karlin, 2001; Rokas, Williams, King, & Carroll, 2003). For exam- 
ple, Karlin and Mrazek have analyzed the systematic differences in dinucleotide 
frequencies within and between species and obtained a biologically plausible phy- 
logenetic tree for mitochondrial and nuclear genomes. Extracting phylogenetic 
properties from genome-wide statistical observables has also been attempted for 
prokaryotes by Qi et al. (2004), who analyzed asymmetries in the distribution of 
“words” of length n (the so-called «-word distribution). An alternative method for 
efficiently condensing (and subsequently processing) genome-wide information is 
given by the average correlation between two nucleotides with a distance k in the 
DNA sequence. A large amount of research has focused on long-range correla- 
tions in DNA up to distances of thousands of basepairs (Li & Kaneko, 1992; Peng, 
Buldyrev, Goldberger, Havlin, Sciortino, et al., 1992; Holste, Grosse, Beirer, Schieg, 
& Herzel, 2003). 

Short-range correlations, which are our focus of interest in the present article, 
received far less attention. The two most important contributions to correlations 
at distances up to a few tens of basepairs are period-three oscillations of the 
correlation strength, which characterize coding regions, and oscillations with a 
period around 10 and 11 basepairs, which correspond to DNA bending properties 
and are a reflection of the double-helical structure (Trifonov & Sussman, 1980); 
Trifonov, 1998). Information theory has turned out to be a particularly convenient 
framework for investigating such properties. The period-3 oscillations lead to pro- 
nounced differences of, e.g., the mutual information function at small distances for 
coding and non-coding DNA segments, respectively. It was observed that these 
differences are species-independent (Grosse, Herzel, Buldyrev, & Stanley, 2000). 
Furthermore, two peaks of the mutual information function around a distance of a 
few hundred base pairs have been related to the internal structure of Alu repeats 
(Holste et al., 2003). 

In several recent studies (Dehnert, Plaumann, Helm, & Hiitt, 2005; Dehnert, 
Helm, & Hiitt, 2005, 2006) we analyzed the average correlation of a symbol within 
a DNA sequence with another symbol at a distance k up to distances of a few tens 
of nucleotides. Our model based approach proved to be superior to just using the 
(average) mutual information (Dehnert, Helm, et al., 2005). We find that these cor- 
relation profiles, when analyzed for a variety of eukaryotic species, display a high 
degree of intra-species similarity and systematic inter-species differences. Intrigu- 
ingly, these inter-species differences seem to increase with evolutionary distance. 
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i.e., a cluster tree based upon distances of the correlation profiles sorts all chromo- 
somes involved into fully separated species clusters within the tree and approximates 
the corresponding phylogenetic tree of these species (Dehnert, Plaumann, et ah, 
2005). 

These correlation signatures are not only interesting due to their classification 
potential, hut also because they provide insight into genome evolution. On an evolu- 
tionary time scale, the correlation pattern of Eukaryotic genomes is thus shaped by 
the production and (mutational) degradation of repeats. The different time scales, 
as well as the different production and degradation rates, attributed to the differ- 
ent classes of repetitive DNA essentially determine the statistical properties of the 
sequence. Deleting classes of repetitive DNA from the sequence thus can, for the 
study of correlation properties, serve as an estimate of ancestral genomes. 

The main purpose of this article is to present the systematics of short-range cor- 
relations in DNA sequences and their potential for classifying sequences. Special 
emphasis is put here on describing the algorithmic tools we use to obtain these 
results, as we hope this description will facilitate the use of these methods also in 
other contexts. 



2 Systematics of Correlation Signatures 

Recently we introduced a technique for quantifying statistical correlations in DNA 
sequences based upon a discrete autoregressive process (see Dehnert, Plaumann, 
et ah, 2005; Dehnert, Helm, et ah, 2005 for details). 

In this type of analysis a DNA sequence (e.g., of a whole chromosome) is repre- 
sented by its correlation curve, i.e., by the vector of correlation strengths a, at the 
distance i . Differences between such correlation curves, when properly normalized, 
can serve as inputs of a clustering analysis. 

In the following the correlation strength of two nucleotides at a distance k is 
represented by the parameter vector a of a DAR(/>) process. Figure 1 shows the 
correlation curves for distances k = 1 , . . . , 30 calculated for all chromosomes of 
six species. This hgure already reveals the essence of our hnding: All chromosomes 
of a single species follow essentially the same correlation curve. In fact, for the 
examples given in Fig. 1 even the degree of interspecies similarity shows a cer- 
tain pattern. Systematic differences between two species’ correlation curves seem 
to increase with the species’ evolutionary distance. Later we will see the limit of 
this simple observation, but for the cases shown in Fig. 1 this phylogenetic aspect is 
rather striking. 

In order to study this visual impression on a more quantitative level we compute 
pairwise distances of the correlation curves (using a standard LI norm of the cor- 
relation vectors; see Dehnert, Helm, et ah, 2005 for details). This yields a distance 
matrix on the correlation curves to which a clustering algorithm can be applied. 
The resulting clustering tree represents a very efficient method of condensing the 
information contained in the correlation curves into a single relational structure. 
In our previous work we found that essentially all chromosomes are automatically 
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Fig. 1 Correlation curves for chromosomes of the following species (a) H. sapiens [22 curves], 
(b) P. troglodytes [23 curves], (c) M. musculus [19 curves], (d) R. norvegicus [20 curves], (e) D. 
melanogaster [5 curves] and (d) A. gambiae [4 curves] . Sex chromosomes have been omitted. In 
all cases the correlation curve is given by the parameter vector a of a DAR(30) process. (Adapted 
from Dehnert, Helm, et al., 2005) 



sorted into the appropriate species cluster (Dehnert, Plaumann, et al., 2005). This 
remarkable feature of chromosome clustering corresponds to the high degree of 
intra-species synchrony of the correlation curves observed in Fig. 1. The clustering 
tree for 125 chromosomes of eight different species based upon the distance matrix 
obtained from the correlation curves is shown in Fig. 2. 

Note that no information other than the distance of correlation curves enters the 
clustering process. In particular, all chromosomes are formally treated as individual 
taxa. Both, the clustering of chromosomes pertaining to the same species and those 
aspects of the tree clearly in correspondence with evolutionary species differentia- 
tion emerge from the correlation curves alone. One observes that in almost all cases 
the chromosomes of a single species form a cluster of their own. The most obvious 
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Fig. 2 Clustering tree for 125 chromosomes of eight eukaryotic species. The figure shows the 
consensus tree of 100 bootstrap replicates. Numbers above branches indicate bootstrap values. 
By attributing a color to each species, each cluster can formally be represented by a color line 
fragment (shown on the right-hand side of each cluster). These color line fragments are sorted via 
the TCC algorithm (see Sect. 3.1) and lead to the color line displayed on the right-hand side of the 
figure. The following species are included: A. gambiae (MO), C. elegans (CE), D. melanogaster 
(DR), G. gallus (GA), H. sapiens (HU), M. musculus (MU), P. troglodytes (CH) and R. norvegicus 
(RA). Eor details on the genomic data see Dehnert, Helm, et al. (2005). The number after the two 
letter abbreviation for the species indicates the number of the respective chromosome. (Adapted 
from Dehnert, Helm, et al.) 
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exception is the complete mixture of human and chimpanzee chromosomes. Clearly 
the correlation curves’ capability of distinguishing between species stops at so small 
evolutionary distances. 

In Fig. 2 we also hnd an example where the clustering tree based on correla- 
tion curve distances clearly deviates from a pure phylogenetic tree (a term strictly 
applicable only on the species level, not on the level of single chromosomes), as 
the position of chicken in spite of the high degree of clustering of its chromo- 
somes is misplaced from a phylogenetic point of view. These deviations from pure 
phylogeny may help reveal the correspondence between genome evolution and 
statistical patterns in sequences. 

An important question is, how the chromosome clustering depends on the amount 
of underlying sequence information. In order to study the dependence of chromo- 
some sorting on sequence length we developed a tool for monitoring the change 
of a clustering tree as a function of some parameter (see Sect. 3.1 for a detailed 
description of the algorithm). The idea of this tree color coding (TCC) plot is to 
apply topologically allowed branch switches to bring a large set of trees (each tree 
belonging to a certain value of the parameter) as close to the same predehned (e.g., 
alphanumerical) order as possible and then color-code the sequence of taxa (by 
assigning a single color to all chromosomes of one species), resulting for each tree in 
a single line, which represents the chromosome sorting. In Fig. 2 the corresponding 
color line is shown next to the clustering tree. On the level of the TCC plot the focus 
of attention is on chromosome clustering instead of on the detailed progression of 
the branching. It is, therefore, appropriate to include only species with more than, 
e.g., four chromosomes into the analysis, as then the order of some color segment 
can be meaningfully evaluated. 

In Fig. 3 it is seen that quite a small amount of data is necessary (approximately 
a few ten thousand bases) to establish already rather ordered clustering patterns. 
It is, however, seen that with increasing sequence length stable horizontal stripes 
are marked out which are a clear indication of stable pairs of chromosomes being 
formed. Occasional jumps of whole blocks in the TCC plot are a side effect of 
the sorting algorithm, where a single outsider chromosome within a homogeneous 
cluster can dehne the alphanumeric label of this cluster and, consequently, induce 
a jump of this cluster, when the outsider chromosome, e.g., disappears from the 
cluster as sequence length is increased. For very short sequences (approx. 30kbp 
and lower) the estimation process of the correlation vector fails. This lack of con- 
vergence is clearly seen in the TCC plot, where the overall order of the color line 
is lost spontaneously. This length scale can be viewed as the statistical limitation 
of our method: shorter sequences cannot be sorted on the basis of their correlation 
vector. 

Note that the chromosome clustering in general (and particularly the distinction 
of mouse and rat) also depends on the range of nucleotide distances. For example, 
the sequence length, for which full distinction is achieved, decreases substantially 
(i.e., the distinctive power of our method increases), when p is increased (Dehnert 
et ah, 2006). 
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D. melanogaster 

A. gambiae 
P. troglodytes 
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Fig. 3 Tree color coding (TCC) plot for 14 Eukaryotic species for the DAR(p)-based correlation 
vectors. The length of the underlying DNA sequences is varied. Eor each length a clustering tree 
is computed and then translated into a color line with the TCC algorithm (cf. Fig. 2). Starting 
with the first 1 kbp of each of the 203 chromosomes of the 14 species the sequence lengths are 
simultaneously increased with step sizes 1 kbp (up to 200 kbp) and 10 kbp (up to 25 Mbp). In case 
of exceeding the length of a chromosome before reaching 25 Mbp the length is kept constant at the 
maximum possible length. (Adapted from Dehnert, Helm, et al., 2005) 




3 Algorithmic Challenges 

3.1 Systematic Comparison of Many Trees: The Tree-Color 
Coding Method 

When studying the parameter dependence of our result we are confronted with the 
task of comparing a substantial number of different clustering trees. 

For our purposes the key observable on such a clustering tree is the quality of 
species distinction, i.e., how pronounced the formation of clusters appears in the 
tree. Comparing such trees requires a universal sorting of the branches. To this 
end we developed a sorting algorithm which translates such a clustering tree into 
a simple line of colors, where the number of color changes basically reflects the 
amount of clustering in the underlying tree and therefore the quality of species dis- 
tinction. The algorithm acts upon the Newick representation of a clustering tree, 
where entries in a list represent taxa and matching brackets denote objects linked by 
branches. Our algorithm, which is illustrated in Fig. 4, first acts upon the innermost 
branches and performs an alphanumerical sorting of the corresponding taxa using 
only topologically allowed branch switches. In the next step one moves to the next 
higher order of branches and applies sorting there. Whenever one encounters non- 
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Fig. 4 Schematic view of the Tree color coding (TCC) algorithm, a Operations upon the dendo- 
gram are shown for a simple clustering tree of five taxa from three different species. Starting from 
the unsorted tree (1) the TCC algorithm (see Sect. 3.1) yields a sorted tree (4) by iterative applica- 
tion of branch switches, b Visualisation of the original tree (1), intermediate steps (2) and (3) and 
the final, sorted tree (4) as TCC lines. (Adapted from Dehnert et ah, 2006) 



elementary objects at the end of one branch (i.e., a subtree instead of a single taxon) 
the alphanumerically lowest object in the subtree serves as a label for the subtree 
itself. After passing through all hierarchical levels in the clustering tree all taxa are 
sorted as close to alphanumerical order as topology of the tree allows. Coloring all 
taxa according to their species affiliation leads to a color line whose homogeneity 
directly reflects the degree of clustering observed in the original tree and, further- 
more, can he immediately compared with any other tree consisting of the same taxa 
due to the universal order of taxa approximated hy the sorting algorithm. The tree 
color coding algorithm slightly overestimates the overall order in the tree, as differ- 
ent branches containing taxa of the same species can become direct neighbours in 
the color line, even if one of them also contains chromosomes of another species. 
It is, however, also clear from Fig. 4 that this systematic error is rather small when 
much more taxa than species (colors) are involved. Figure 4 illustrates our tree color 
coding algorithm with a very simple tree consisting of five taxa from three different 
species. 



3.2 Memory and Run Time Management for Large Genomes 

To quantify the correlation between two symbols in a distance k a discrete autore- 
gressive process of order p, DAR(/>) (see Jacobs & Lewis, 1983) is fitted to a given 
sequence. We use the estimated parameter vector a of the DAR(/>) process as a 
measure for correlation strength in a distance k. This estimation process basically 
consists of two steps. First an empirical autocorrelation function (ad hoc estimator) 
in symbolic space is estimated for a given DNA sequence. The second step leads 
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from these quantities f{k) to the actual parameters of the DAR(/>) process by 
solving a set of Yule-Walker equations. 

The ad hoc estimator for the correlation strength is defined as follows (Jacobs & 
Lewis, 1983); 



f{k) = 1 



^ ^ (k , tZj ) 

a,- 6.4 



1 



1 — 7t{ai) 



for k = 1,2,... with 



j m—k 

Bm{k, Cli^ — y ^ ^ (X/4*^), 

m — k ^ ^ ^ 

aj ^Qi €A / = 1 



( 1 ) 



( 2 ) 



where Sy{x) = 1, if x = y, and 0 otherwise. 

The size of eukaryotic genomes makes it necessary that the computational pro- 
cessing of the DNA data is time and memory efficient. By applying different 
strategies one can improve the performance of the implementation. We here line 
out our approach for a run time efficient implementation. 

The implementation of formula (2) can be optimized, e.g., by reducing the 
number of required loops. For an alphabet X = {a\ , U 2 , . . . , an} and x € A holds 

Sy(x) = S}^(x) Vx e A, (3) 

>'6.4\{j} 



where S\y(x) = 0, if x = y, and 1 otherwise. 

In this way one can replace a sum in the implementation by a negation and 
formula (2) can be written as 



^ m—k 

Bm{k, Qi) = ^ S\ai iXl)Sa; (Xl+k). (4) 

jn — K 

/=i 

Another way to optimize the efficacy is to restrict the alphabet. Since the ana- 
lyzed symbolic space of DNA sequences consists of four symbols, we specialized 
the formula given in (4) and (1) for the alphabet A = {A,G,C,T}. 

Parallelizing the calculation of the ad hoc estimator is another effective way of 
reducing the running time. The degree k^ax of parallelizing reflects the amount of 
simultaneously calculated f(kys with k = I, , k,nax- Using this computationally 
improved algorithm the proposed DAR(/i)-method can be extended to mid- and 
even long-range analysis (values of p in the range of several thousands). Figure 5 
displays the run time depending on the degree of parallelization. One observes a 
substantial reduction in running time with increasing parallelization. At some value 
of kynax the increasing number of variables to be initialized and maintained leads 
to a gradual increase of running time at even higher values of k^ax ■ At present the 
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Degree of parallelization 

Fig. 5 Total running time in seconds that is required on a Intel PC (Core 2 6600, 2.4 GHz, 1 GB 
RAM) under Windows XP to determine one of p f(k) values using parallelization of degree k,„ax 
with p > k,„ax . divided by the sequence length in Mbps 



implementation is independent of special hardware and shows the same behavior on 
different platforms and operating systems. 



4 Conclusion 

From a theoretical perspective, the genome is a dynamically expanding object on an 
evolutionary scale. Statistical properties of DNA sequences help understand charac- 
teristics of genome evolution. Over the last decade short-range correlations in DNA 
sequences have proven quite informative in this respect. In this sense, correlation 
signatures are predominantly process signatures. 

Another motivation for studying statistical properties of DNA sequences is the 
task to classify them according to species affiliation and other biological criteria. 
This is particularly important in the emerging discipline of ecological genomics, 
where the spatial and temporal distribution of DNA sequence fragments from large- 
scale sequencing and metagenomics projects is studied. 

While the principal tools of analyzing short-range correlations are not yet ready 
to really contribute to micro-organism classification, we have reported here a range 
of results showing the efficiency of this method in the Eukaryotic domain. 

An automatized search for statistical patterns, the classihcation of these patterns 
using clustering methods and the interpretation of the patterns in the context of 
genome evolution are the principal aims of our studies. With the present article we 
want to make these tools accessible to other contexts of classifying sequence-like 
data according to correlation patterns. At the same time we want to highlight some 
of the rich and fascinating features that DNA sequences display, when analyzed 
from this statistical perspective. 
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On Classification of Molecules and Species 
of Representation Rings 



Lothar Haberle 



Abstract The classification of molecules by symmetry groups is a well-established 
application of mathematical group theory in natural sciences. Groups and their char- 
acters are thereby used to determine physical properties of molecules. 

We discuss species, a generalization of characters, which is less known outside 
of mathematics. Explicit species formulae for certain groups and some theoretical 
background are provided with the aim to find applications, for instance, in chemistry 
in the future. 

Keywords Classification molecules ■ Representation ring ■ Species ■ Symmetry 
group. 



1 Introduction 

In biology and chemistry crystal structures and symmetries of molecules, for exam- 
ple, are classified by mathematical groups. The assigned groups can then be used 
to determine physical properties such as polarity and chirality (see Sects. 2 and 3 
or Atkins & de Paulo, 2006). 

Representations of groups as linear transformations of vector spaces and, more 
generally, modules enables many group theoretical problems to be reduced to prob- 
lems of linear algebra, which is a well understood theory. Defining addition and 
multiplication via direct sum and tensor product on the set of these modules and then 
considering them as elements of a ring, the representation ring, is an approach to 
examine such modules (see Benson & Parker, 1984; Benson, 1991; Curtis & Reiner, 
1981; Feit, 1980). In order to investigate representation rings one may study their 
structure preserving maps to the complex numbers, which are called species. Species 
may be regarded as a generalization of characters, a well-established concept in 
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representation theory with many useful applications, for instance, in natural sciences 
(see Sects. 3 and 5). 

Species of finite groups G whose largest subgroup P of prime power order for 
some prime number is cyclic are recently discussed in Haberle (2008). We apply 
results from Haberle (2008) and state explicit species formulae for such groups with 
the additional condition that P is normal in G (Theorem 1). Throughout the paper 
we illustrate the theoretical statements with examples which relate to the classifica- 
tion of molecules. For many groups of small order fall into our category, we hope 
for applications of species outside of mathematics in the future. 



2 Classification of Molecules by Symmetry Groups 

A symmetry of a geometric object is a transformation that leaves the object 
unchanged. For example, the symmetries of an equilateral triangle are rotations 
by 0°, 120°, 240° and three reflections at the perpendicular bisectors of the sides, 
because these six transformations do not change the appearance of the triangle. The 
set of symmetries of any geometric object together with the composition of two 
transformations as operation form a mathematical group which is called symmetry 
group. The transformation which does nothing (rotation by 0°) is the unity element 
of the group. 

In physical chemistry molecules are considered as three-dimensional geometric 
objects in order to classify them by identifying all their symmetries and group- 
ing together molecules that posse the same set of symmetries. This procedure, 
for example, puts together the molecules NH 3 and POCI 3 into a class and the 
molecules H 2 O and SO 2 CI 2 into another class. There are five kinds of symmetries 
of three-dimensional objects and molecules, respectively: 

• The identity id does nothing 

• The n-fold rotation a„ rotates through (360/ n)° 

• The reflection in a horizontal (t/,), vertical (t,,) and dihedral {xd) mirror plane 

• The inversion Xc through the center of the object 

• The «-fold improper rotation is an «-fold rotation followed by a reflection 
through a plane perpendicular to the axis of that rotation 

We will not go into detail how a systematic classification works but refer to Atkins 
and de Paulo (2006). In Table I we present small symmetry groups and for each 
group a few examples of molecules. The symmetry elements of a group and the 
group generators are listed. Each symmetry can be obtained by compositions of 
generators. For instance, a three-fold rotation CT 3 and a reflection in a vertical plane 
Xy generates Csy, the symmetry group of an equilateral triangle (see above) and 
the molecules HN 3 and POCL 3 , respectively. The other reflections of Csv are the 
compositions Ty/ = a^Xc and Ty/ = a^Xc. Generators are used to describe groups, 
because they clearly provide all necessary information. Table 1 lists all groups up 
to order 6 except of C5. The groups C„ with some n e N are cyclic, i.e., they 
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Table 1 Examples of small symmetry groups and corresponding molecules 



Group 


Order 


Symmetries 


Generators 


Examples 


Cl 


1 


id 


id 


CBiGIFl 


Ci 


2 


id, Xc 


Tc 


Mesotartaric acid 


Cr 


2 


id, thiv/d 


'^hjvjd 


SOCl2,C9H7N,C3H6 


C 2 


2 


id, CT 2 


ai 


H 2 O 2 


C 3 


3 


id, CT 3 , ct| 


02 


9bH-Phenalene 


C 4 


4 


id, CT 4 , (j|, al 


04 


Calix[4]arene derivative 


Civ 


4 


id, 02, Tv, T,/ 


0 - 2 , T„ 


H20,S02Cl2,03,SE4 


Civ 


6 


id, 03, 03 ^, Tv, T,/, tv" 


03 , Tv 


NH3,POCl3,CHCl3 



can be generated by only one element: C„ = (a„). The group C 2 v is isomorphic 
to C 2 X C 2 , Cj,v is the smallest non-abelean group. Mathematically one does not 
distinguish between C, , C,-, C 2 because these groups are isomorphic. 

Properties as polarity and chirality of a molecule can be explained as soon as 
its symmetry group has been identified. A polar molecule is one with a permanent 
electric dipole moment. Only molecules belonging to the groups C„, C„v for some 
« > 1 and Cr may be polar. A chiral molecule is a molecule that cannot be super- 
imposed on its mirror image. A molecule may be chiral only if it does not posses an 
improper rotation. For a detailed discussion see Atkins and de Paulo (2006). 

In general, groups are hard to understand, because their inner structures can be 
very complex. For instance, familiar rules for calculating with numbers as commu- 
tativity mostly do not hold. In order to simplify group theoretical problems one may 
use representation theory which will be introduced in the next section. 



3 Ordinary Representations of Finite Groups 

Let G be an arbitrary finite group. For a finite dimensional vector space V over C, 
we consider the group of automorphisms of V 

Autc(F) = {f '■ V — ^ V I / C-linear and bijective}. 



A group homomorphism 



A : G ^ Autc(F) 



is called ordinary representation of G. A representation is said to be irreducible if 
{0} and V are the only subspaces W of V with A(G)(fF) = W. By Maschke’s 
theorem, each representation can be decomposed into a direct sum of irreducible 
representations. So, irreducible representations can be seen as basic building blocks 
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of all representations in much the same way as prime numbers are the basic building 
blocks of the natural numbers. Since bijective linear maps and invertible matrices 
correspond, one may define a representation of G via the general linear group of 
n X 72 -matrices over C : 

A : G ^ GL{n,C) 

with 72 = dime V ■ Each element of G is represented as matrix. Since A is a struc- 
ture preserving map, A transforms the inner structure of G to a set of matrices. In 
order to study groups, one may studies matrices, and the study of matrices belongs 
to the well-understood theory of linear algebra. The map 

X ■ G — ^ C, g I — ^ traceA(g) 

is called (ordinary) character of G. Characters are class functions, that means, they 
take a constant value on a given conjugacy class. The characters of irreducible rep- 
resentations are listed in character tables with representatives of conjugacy classes 
as header. Characters are the starting point of a highly developed theory. We refer 
to Robinson (1982) for further reading. Tables 2 and 3 show the character tables of 
the symmetry groups from Table 1. 

In our chemical application, the characters of irreducible representations stand 
for orbitals that belong to the various atoms in a molecule, and the symmetry char- 
acteristics of orbitals can be explained with the character table. For instance the 
entry -|- 1 shows that an orbital represented by x\ remains the same and the entry —1 
shows that the orbital xi changes sign under the symmetry Xy of the group Csy (see 
Table 3). The character value of id tells the degeneracy of the orbitals. In C^y the 
orbitals Xi ^nd xi are non-degenerated, the orbital Xi is double-degenerated and 
there are no triple degenerate orbitals, because Xi (id) 7^ 3 for all i . 

In a certain sense ordinary representations are special cases of modular represen- 
tations which are dealt in the next section. 



Table 2 Character tables of groups up to order 4. Denotations: i £ C with = — 1 ; f £ C with 
= 1 and f 7 ^ 1 



C 2 


id 


CT 2 


C 3 


id 


CT 3 




C 4 


id 


0-4 






Czv 


id 


CT 2 


Xy 


V 


Xi 


1 


1 


X\ 


1 


1 


1 


X\ 


1 


1 


1 


1 


XI 


1 


1 


1 


1 


Xi 


1 


-1 


Xi 


1 






Xi 


1 


i 


-1 


—i 


X2 


1 


1 


-1 


-1 








Xi 


1 






Xi 


1 


—i 


-1 


i 


Xi 


1 


-1 


1 


-1 
















/f4 


1 


-1 


1 


-1 


/4 


1 


-1 


-1 


1 



V 



C 3 V 


id 


03 


rv 


Xi 


1 


1 


1 


Xi 


1 


1 


-1 


Xi 


2 


-1 


0 



Table 3 Character table of C 3 , 
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4 Modular Representations of Finite Groups 

Again, let G be a finite group. From now on, we fix a prime p and a field k of 
characteristic p (see an example in Bradley, 2007) which is algebraically closed. 
The A: -linear combinations of elements of G form a ring which is denoted by A:G. 
Closely related to ordinary representations, a homomorphism 

A : G ^ Aut,t(M) 

is said to be a modular representation of G over k, where M is a finite dimensional 
vector-space over k. The vector space M becomes a A:G-module via 

g ■ m := A(g){m), g ^ G, m € M. 

Modules are generalized vector-spaces, both definitions are equal except of that the 
scalars of a vector space are taken from a field and the scalars of a module are 
taken from a ring. Since fields are rings, vector spaces are modules. As direct sums 
and tensor products of vector spaces one defines direct sums and tensor products 
of modules. A module M is called indecomposable if the direct sum M = Mi © 
M 2 implies M\ = 0 or M 2 = 0. In contrast to indecomposable vector spaces 
indecomposable modules are not necessarily one-dimensional. For /cG-modules M 
and N, the decomposition of its tensor product 

M igik N ^ Ml® M 2 ® Mr 

into indecomposable A:G-modules M, 7 ^ 0 is unique up to isomorphism 

(Krull-Schmidt theorem). Fligman showed that there are finitely many isomorphism 
classes of indecomposable A:G-modules if and only if G contains a cyclic Sylow p- 
subgroup, i.e., a subgroup P < G which is cyclic and of maximal /i-power order 
(see Pierce, 1982, Theorem 10.8). 

Representations of G over k and kG modules are in 1-1-correspondence. More- 
over, similar representations and isomorphic A:G-modules correspond. Thus there is 
a bijection of similarity classes of representations of G over k and isomorphism 
classes of A:G-modules. For that reason one may study A:G-modules instead of 
modular representations (see Curtis & Reiner, 1981; Feit, 1980). 

From now on we assume that G contains a cyclic Sylow p-subgroup and denote 
the isomorphism classes of indecomposable modules by Mi , . . . , M„ . For conve- 
nience we will not distinguish between isomorphism classes of modules and the 
modules themselves. We consider the (isomorphism classes of) indecomposable 
modules as basis elements of a C vector space. An arbitrary /cG -module is regarded 
as element of A{kG), especially as linear combination of indecomposable modules 
which is implied by the decomposition into indecomposable modules. The product 
of two elements of A{kG) is defined by the decomposition of the tensor product 
into indecomposable modules. Since these decompositions are unique up to iso- 
morphism by the above mentioned Krull-Schmidt theorem, the multiplication is 
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well-defined and A{kG) becomes a ring which is called representation ring (or 
Green ring). The ring A(kG) is semi-simple, i.e., A(kG) splits into a direct sum of 
one-dimensional subrings 



^(icG) = Cei ©■■■©Ce„ ^ C" (1) 

which are generated by idempotentsei , . . . ,e„ (see Benson, 1991;Feit, 1980; Curtis 
& Reiner, 1981). If one knows the ring isomorphism A{kG) — ^ C", one may work 
with the familiar ring C" instead of A(kG). 



5 Species of Representation Rings 

A C-linear ring homomorphism s : A(kG) — ^ C with 5 7 ^ 0 is called species. 
For an idempotent e e A(kG) one has s{e)^ = ■s(e^) = s{e), hence s{e) = 0 or 
s(e) = 1. It is easy to see that the decomposition (1) implies that there are exactly 
n species s\, . . . ,s„ for groups with cyclic Sylow / 7 -groups. Without loss we may 
assume 



^(■(x:) = r,-, X = rid + ■ • ■ + r„e„ e A(kG). 



Then the map 



AikG)^C\ x\-^ (siix),...,s„(x)) 



is the sought C-linear ring isomorphism (see Benson, 1991), and a A:G-module is 
uniquely determined by all n species values. Species simplify working with mod- 
ules. For instance, the decomposition of a tensor product may be reduced to solving 
linear equations. As characters are listed in character tables, species are noticed in 
species tables with indecomposable modules as header. 

The species of cyclic /?-groups are well-known for a long time (see Green, 1962). 
They can be defined inductively starting with the ones of a group of prime order. The 
species values are real numbers which can be expressed as linear combinations of 
complex roots of unity. The indecomposable modules are completely characterized 
by their dimensions as vector spaces over k and are here denoted Tj . Table 4 shows 
the species of cyclic groups up to order 4. 

The species of arbitrary finite groups with a cyclic Sylow / 7 -group were recently 
discussed in Haberle (2008). The species formulae there, however, cannot be applied 
without a deeper understanding of the modular representation theory. Here we will 
state explicit species formulae for groups whose Sylow / 7 -groups are both cyclic 
and normal. 

Let G be a finite group with a normal cyclic Sylow />-group P < G. By the 
Schur-Zassenhaus theorem G — P C for some C < G, i.e., G = {at \ a e 
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Table 4 Species tables of cyclic groups up to order 4 



C2 


Tl 




Ci 


Tl 
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C 4 


Tl 


T 2 
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tl 
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tl 
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tl 


1 


2 


3 


4 
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1 


0 


tl 


1 


-1 


0 


tl 


1 


2 


1 


0 








h 


1 


1 


0 


h 


1 


0 


-1 


0 
















?4 


1 


0 


1 


0 



P ,r € C}. An indecomposable ^G-module then only depend on an irreducible 
character of C and on the module’s dimension as vector space over k (see Curtis & 
Reiner, 1981; Haberle, 2008; Robinson, 1982): 



Mo.tp.j, ^ <D < P. (p & Irr(C), 1 < 7 < |0|, p\ j. 

The dimension of is (P : D)j with (P : D) := |T’|/|D|. Before 

we state the main result of this paper, we have to introduce some functions on 
C. The quotient G!Cg{P), where Cg(P) denotes the centralizer of P in G, is 
a cyclic group whose order e divides p — I (see Feit, 1980, Chap. VII). Hence 
G/Cg(P) = {t^Cg{P)) forsomer e C.Let^ e Cbeaprimitivee-throotofunity. 
An irreducible character a of C of special interest is defined through q!(t'c) := 
for i e N and c e Cg{P) H C. For c € C with c ^ Cg{P) and « e N let 



y(c, n) 



a{cf - 1 
a(c) — 1 



! = 1 



if « > 0, and y{c,n) = 0 if n = 0. Then 

y{c,p‘^r) = y{c,r), d € N, I < r < p, (2) 



because a{c) is a (/? — l)-th root of unity. Furthermore, let be a complex val- 
ued function on C with = a and P{x) — ^/a(r) for some square root. One 
says a species t of A(kD) with D < P has vertex D, if f(T}) = 0 for all j 
which are divided by p. The set of species of A{kD) with vertex D is denoted by 
Sp(A(/cD), D). The following theorem specializes (Haberle, 2008, Theorem 4). 

Theorem 1. Let G be a finite group with a normal cyclic Sylow p-group P and let 



Y ■= {(D,c,t)\ D < P, c e C, f e Sp(A(A:T)), D)}. 



The species ofA(kG) are 



SD.CJ ■■ A(kG) (D,c,t)€Y 
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with 



^D.cJ i^Q,(pJ ) 



f 

0 

tp{c)^i-\c)t(Tj) 
< tp{c){P : D)t{Tj) 
tp{c) X 
(p{c)y 



ifQ < D 

if Q = D and c f. Cg{P) 
if Q = D and c G Cg(P) 
if Q > D and c ^ Cg(P) 
if Q > D and c e Cg{P) 



and 



X = y{c, n) i6(cr t(T,„+i) + y(c, (Q : D) - n)i6(c f t(T,„) 
y = (P: Q)n tiT^+i) + {(P : D) - (P : Q)n) t(T^) 
j = {Q : D)m + n, Q < m < \D\, 1<«<(0:Z)) 

Proof. Let s = So.c.t and M = Then M has vertex Q and s has ver- 

tex D. Thus s{M) = 0 if Q < D. The statements for 2 = ^ and M^ ij follow 
from Haberle (2008, Theorem 4, Propositions 6 and 12). 

Now let 2 > D. First we work in A{kP) and in representation rings of sub- 
groups of P. With the restriction map and the induction map and by applying 
Green’s indecomposibility theorem and the Mackey formula (see Benson, 1991; 
Curtis & Reiner, 1981) one gets 



res£7’(p;g)y = res^indg Tj = {P : Q)KS%Tj. 



Write j = (Q ■ D)m + n with 0 < m < |D| and 1 < « < (0 : D). Then 
res^r,- = n 7)„+i -h ((Q : D)-n) 

(see Green, 1962). We identify T(p-.Q)j e A(kP) with Mq \ j g A(kG) by letting 
G act on T(p-.Q)j as in Haberle (2008). One easily proves that the decomposi- 
tion of res^ ^Q.i.j itito indecomposable summands may be concluded from the 
decomposition of res£ T{p-.Q)j similar to Haberle (2008, Proposition 10). Hence 

(P-.Q)n (P-.D) 

1=1 i = iP-.Q)ii+l 

with indecomposable A:(£), c)-modules Z^j. The statement for Mq \ j now fol- 
lows from Haberle (2008, Theorem 4) and (2). Since the module Mp,^,\ is one- 
dimensional, res^ = Zp|(c),i. Thus 5 £),c,((^/>.i?i.i) = i/)(c) by Haberle 

(2008, Theorem 4). Finally, the theorem holds for M and Q > D, because 
M = Mq^j ■ due to Feit (1980, Lemma Vll.2.1). □ 
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Table 5 Species tables for Csv with p = 3 
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;s for Ciy 
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A/2 A/3 


■Sl 


1 


2 2 


■^id 
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1 2 


S2 


1 


2 -1 


Sa 




1 


1 -1 


S3 


1 


0 

0 


Sz 




1 


-1 0 



Theorem 1 shows that the species of A(kG) are constructed with the species of 
the subgroups of P, and the irreducible characters of C. The following corollary 
applies the Theorem to the case \ P\ = p where D = I and D = P are the only 
/7-subgroups of G. 

Corollary 1. Let G be a finite group with a normal Sylow p-group P of prime 
order. Then 



^D.c.t 



0 


ifD = 


<p{c)fJ-^ic)t{Tj) 


ifD = 


(p(c) 


ifD = 


(p{c)p 


ifD = 


(p{c)y(c,j) 


ifD = 


vie) j 


ifD = 



P and Q = 1 
P and Q = P 

1 and 2=1 and c ^ Cg(P) 

1 and 2 = 1 and c e Cg{P) 

1 and Q = P andc ^ Cg(P) 
1 and Q = P andc e Cg{P) 



The symmetry group fulfils the conditions of the corollary for p = 3. There 
is P = (cts) and C = (t^) = C 2 . The irreducible characters of C are listed 
in Table 2, the species of P are in Table 4. We have Cg{P) = P and hence 
G/Cg(P) = {tyP) and ^ = —1. Thus a = Xi and fi(tv) = Now the 

corollary can be applied to get the species of C^y with /> = 3 in Table 5. The species 
for p = 2 in Table 6 are not implied by Theorem 1, because then P is cyclic but 
not normal in G, and we have to refer to Haberle (2008). The species for p 2,3 
coincide with the characters as will be shown in the next well known corollary. 

Corollary 2. Let G be an arbitrary finite group. If p does not divide the order of 
G, then species and characters coincide. 

Proof. If p \ I G I , then P = \ and P is normal in G . The representation ring A(kG) 
is generated by one-dimensional modules which are uniquely determined by an 
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irreducible character. The species only depend on some element c & G and one 
has Sc{(p) = (p{c) for (p e Irr(G). Two species Sc and Sc' coincide if and only 
if c =G c'. Hence the species table with representatives of conjugacy classes as 
row names and irreducible characters as column names is a transposed character 
table. □ 

The corollary says that modular representation theory generalizes ordinary rep- 
resentation theory, where both representation theories coincide if the characteristic 
of the field k does not divide the order of G. 



6 Conclusions 

The application of group theory and ordinary representation theory with charac- 
ters as main concept is well established in natural sciences, whereas the modular 
representation theory is less known outside of mathematics. 

Our aim is popularizing species, the equivalent of characters, such that one day 
chemists, for example, read properties of molecules from species tables as from 
character tables. The explicit species formulae in Theorem 1 for groups with cyclic 
and normal Sylow subgroups show that in this case new species can be easily con- 
structed from species of subgroups which are already known and from characters 
and with the help of some basic group theoretical results. 

The nice property that a A: G -module is uniquely determined by a few complex 
numbers, its species values, does not hold for arbitrary finite groups (see Benson 
& Carlson, 1986). Groups with non-cyclic Sylow subgroups yield infinitely many 
indecomposable modules and infinite dimensional representation rings, respectively, 
which are more difficult to understand than the cyclic Sylow case. In the non- 
cyclic Sylow case, one therefore studies finite dimensional subrings and quotient 
rings (see, e.g., Benson, 1991; Boltje & Kiilshammer, 2006; Muller, 2008). In each 
case, species can be obtained from already known species of subgroups with the 
restriction map (see Curtis & Reiner, 1981; Benson, 1991). 
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The Precise and Efficient Identification 
of Medical Order Forms Using Shape Trees 



Uwe Henker, Uwe Petersohn, and Alfred Ultsch 



Abstract A powerful and flexible technique to identify, classify and process docu- 
ments using images from a scanning process is presented. The types of documents 
can be described to the system as a set of differentiating features in a case base using 
shape trees. The features are filtered and abstracted from an extremely reduced scan- 
ner image of the document. Classification rules are stored with the cases to enable 
precise recognition and further mark reading and Optical Character Recognition 
(OCR) process. The method is implemented in a system which actually processes 
the majority of requests for medical lab procedures in Germany. A large practi- 
cal experiment with data from practitioners was performed. An average of 97% of 
the forms were correctly identified; none were identified incorrectly. This meets the 
quality requirements for most medical applications. The modular description of the 
recognition process allows for a flexible adaptation of future changes to the form 
and content of the document’s structures. 

Keywords Case based reasoning ■ Document identification • Document process- 
ing • Optical mark reading ■ Segmentation • Shape tree • Similarity. 



1 Introduction 

Identifying documents on the basis of scanned images is a frequently used process 
in image processing. Particularly in document management systems (DMS) it is a 
basic prerequisite. 

Although it is possible to gather information by analyzing the raster graphic, 
efficiently identifying or ascertaining similarity is difficult without first assigning 
unique characteristics. A possible approach to identification without unique identi- 
fiers is to abstract the image. This makes distinguishing the regions that form the 
characteristics possible. Depending on the degree of abstraction, the image’s layout 
can be determined. Hierarchies can also be employed to describe the regions. 
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This article describes the identification of documents on the basis of abstracted 

images. The approach includes: 

1 . Image pre-processing is necessary for successful identification to create a suit- 
able starting point for the comparison. 

2. Shape trees can save the distinctive characteristics that describe the layout 
sufficiently. 

3. Using Case-Based-Reasoning (CBR), the cases saved as shape tree can be 
accessed. These make efficiently searching for similar documents and thus the 
a priori learned case possible. 



2 Geometrical Shapes for Determinin g Similarity 
2.1 Object Recognition 

The general problem model consists of object recognition, which seeks to detect 
the presence of a known object (here a document) in a new image. Object (class) 
recognition is basically a classification problem: assign a class label to an input 
vector. Some techniques are described in Ferrari, Fevrier, Jurie, and Schmid (2008); 
Epshtein and Ullman (2005); and Lowe (2004) and further methods in Comaniciu 
and Meer (2002); Nister and Stewenius (2006); Lowe (1999); Leibe and Schiele 
(2003); and Leibe (2004). 

We present a hierarchical description of objects and associations. With case- 
based reasoning techniques and instance-based learning the object class recognition 
can be regarded as a classification problem, where the class is predicted by means 
of the query image representation. 



2.2 Shapes as Models for Regions 

If regions can be modeled using geometrical shapes such as rectangles or polygons, 
than such a model can also be used to more precisely calculate similarity according 
to a degree of abstraction (Lunze, 2005). Shapes in different regions can have the 
following relationships between each other: 

• One region’s shape can contain another. 

• The shapes can partially overlap. 

• The shapes can touch at one or more points. 

• The shapes can be disjunctive, in which case a minimum distance can be set for 
how far apart two shapes can be. 

If producing various shapes for a domain and identifying them with similar 
regions is possible, then it follows that it is possible to differentiate the similarity 
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between two shapes as follows: 

• One shape contains the other. 

• The shapes overlap. 

• The shapes are not more than a set distance apart from each other. 

A similarity definition, which can be calculated for any two shapes by a similarity 
function SIMshapes, can lead to the following results: 

SIMshapes = (-yii-yi) = (distance, contains, matches, touches). 

The values represent the following: 

• Distance: the distance between two shapes 

• Contains: one shape contains the other 

• Matches: the shapes overlap 

• Touches: the shapes touch 



2.3 Modeling Regions as a Shape Tree 

The aim is to produce regions modeled by shapes as a tree to efficiently search 
for similarities. It should thus be possible to find similar entries for a given query 
shape. To simplify the process, each shape is assigned a minimum bounding box 
that contains the shape. In an initial step, this makes calculating the distance easier, 
as only the distance between the two rectangles is calculated. Only when it has been 
ascertained that both rectangles are within a defined area, are the original shapes 
used. Here, the definition of shapes and forms is by no means limited to primitive 
object models. Using a large number of descriptors is possible. For example, results 
have been achieved using edge detectors in Abmayr (1994) and Rosenfeld (2006) 
and Fourier descriptors in Persoon and Fu (1977) and Rosenfeld (2006). 



2.4 Shape Tree Structure 

The following discussion is limited to simple primitive forms as a starting point. 
Although, as already mentioned, a multitude of definitions is possible, for the pur- 
poses of illustration the shapes used here are simple. Figure 1 shows six shapes, 
some of which overlap or are contained in some other shape. Figure 2 shows the 
corresponding shape tree. 

The top node contains the rectangle F, which contains all the remaining shapes. 
The shapes A, B, C and E follow as children of the root node, which are contained 
in F but in no other shape. 

In contrast. Shape D is subordinate to C because the triangle D is located in the 
circle C. The shape tree is not height balanced. However a degree of balance is 
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Fig. 1 Shapes in different 
geometric relations 





Fig. 2 Shape tree of objects in Fig. 1 



possible by limiting a node to a certain maximum number of children. By splitting 
the children nodes, the maximum number of child nodes can be maintained. 

The following formal expression can be used to represent a shape tree’s data 
structure: 

• Node: f = { f„, fS„), where f„ is the name of the node and fS„ is a quantity 
of attributes. 

• Attribute: s = {s„ , sFct), s € fS where s„ is the attribute name and sFct is the 
number of facets. 

A node in a shape-tree, which describes a region, must have the following 
attributes as a minimum: 

- Shape: A geometrical shape, which models the region in question. 

- Bounding box: A geometrical shape (here, a rectangle), which contains the 
model shape. 

- Semantic: A semantic description of the region. 

- Neighborhood distance: Shapes within this distance are considered as neigh- 
boring and therefore as similar. 

• Facet: Fct = (Fct „ , FctEINTR), where Fct„ is the name (type) of the facet and 
FctEINTR is the number of entries represented. 

Different facet types make a specific representation of a shape tree’s knowledge ele- 
ments possible within the context of the attribute characteristics. Apart from the 
actual attribute characteristics, default values or methods, etc., can be saved for 
further processing. 
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In constructing a shape tree, the following definitions and characteristics must be 
considered: 

1. A shape-tree’s node / = (/„, /5„) defines a classifier. 

A classifier e(x, f) for a quantity M is a transformation e : M ^ I with 
/ = {0, 1}, P = {x \ e{x, f) = 1} (quantity of positive components) and 
N = {x \ e(x, /) = 0} (quantity of negative components). 

2. A node fi in shape tree K is more general than a node f 2 {fi < /i) iff x e M is 
true for all: if e 2 {x, f 2 ), then also e\{x, f\). 

3. The shape tree must always be constructed in such a way that fi > f .child is 
true. 



2,5 Searching in a Shape Tree 

The following describes the method for efficiently searching for similar shapes in a 
shape tree. 

Prior to searching for shapes the criteria for deciding whether two shapes are sim- 
ilar must be set. Namely, a similarity interpretation function based on the displayed 
similarity description must be defined. 

SRIji : SR -^1,1 = {identical, similar, not similar, ambiguous} 

Searching begins with the inspection of the node’s bounding box. If the inspec- 
tion is successful, that means the shapes are similar, the model shape is verified, as 
long as the node is of the semantic type. If both the model shape and the query shape 
meet the criteria of the comparison, the node with the model shape having semantic 
value part of the solution pool of similar shapes. The children of the current node 
are then recursively inspected. 

If the query criteria is not met, the child nodes do not require any further atten- 
tion. This follows from the shape-tree definition: the shapes of the subordinate nodes 
are contained within the bounding box of the current node. 

The comparison of the individual shapes after successfully testing the distance 
between the bounding boxes must consist of two components. On the one hand, the 
distance between the forms with regard to their length, via an Euclidean distance 
measure and, on the other, the forms themselves are compared. Comparing Fourier 
descriptors is useful here if the shapes are more complex. In the case of simpler 
forms, simpler comparison mechanisms can be applied. For complex forms other 
methods of pattern matching are suitable. 

Disparities within a query image can cause recognition. An object of a circular 
form fails, although it is saved at the appropriate location in the reference tree. In 
such a case, a more complex form description using Fourier descriptors is carried 
out so that a comparison between different description forms is possible. 
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Fig. 3 Order form for medical lab tests 



Starting from the root node, the tree is traversed to the leaves and a matching 
of the respective children nodes is performed. The form similarity is ascertained hy 
inspecting the type of corresponding object and applying a similarity measure. 

It is then possible to compare two cases, and therefore also search the saved 
prototype cases. 



3 Document Identification of Specialized Order Forms 

Identifying medical order forms poses a special challenge in terms of both the qual- 
ity and quantity of processing. The order forms contain the medical requests as 
markings, which are marked in a defined grid on the form. After identifying the 
form, the processing rule is read as part of the solution and the forms are analyzed 
using special OMR (Optical Mark Recognition) techniques. Coordinates and rules 
for analyzing OCR fields can also be saved as part of the solution. 

Identifying such forms represents a special case of our algorithm. Due to the 
design of such forms, only rectangular frames are used in the shape tree. The case 
base is saved as a modihed shape tree. The tree’s nodes each represent a shape (box). 
In this specihc case, only rectangles are used. These may contain subordinate boxes. 
The following definitions are set, which specify the actual application at hand: 

• Leaf nodes are boxes. 

• Boxes in children nodes must be completely contained within the parent node’s 
box. 
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Fig. 4 Reduced copy of a form 



In creating the shape tree, the following is true: 

• A shape tree is structured by iterative decomposition of the image in the frame. 

• Only rectangles are employed due to the requirements of the used forms. 

After the necessary pre-processing (deskewing and cropping) the derived shape tree 
is compared with the available case base. If a case similar to the current is not found, 
we apply Instance-based learning techniques like IBL3. 

The cases to be identified must be known to the system a priori. In the present 
application example, returning the most similar case does not make sense because 
it generally differs from the query and, among other things, can lead to incorrect 
results in the semantic interpretation. Similarities in the identification of the forms 
are only allowed when comparing the distinctive interpreted areas. 

On a new query, the search starts in the case base by pre-processing the query 
image. A reduced copy of the form is first produced (Fig. 4). Based on this, the con- 
nected areas of the image are determined and a query shape-tree is formed using the 
line coincidence algorithm discussed above. The latter describes the image’s char- 
acteristics sufficiently. Figure 5 shows the identified connected areas from which the 
shape tree is formed (Fig. 6). 

The subsequent search in the case base is iterative; in the first step the shape tree’s 
structures are compared. If matching is not found, the second step compares the 
leaves according to a defined distance measure. If there exists sufficient similarity, 
the case is identified and can be further process according to a set rule. 
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Fig. 5 Representation of the image areas 




4 Experiments 



In this chapter, the quality of the presented process was demonstrated in a series 
of experiments. The quality of the form identification is influenced by the process 
for calculating the shape tree. Different parameters in the reduction of the image 
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Table 1 Results 



Form 


Number 


Correct-positive 


Sensitivity (%) 


Specificity (%) 


Form 1 


785 


763 


97 


100 


Form 2 


780 


759 


97 


100 


Form 3 


781 


747 


96 


100 


Form 4 


117 


115 


98 


100 


Total 


2,463 


2,384 


97% 


100% 



information lead to differing connected regions, and these in turn form the shape 
tree. The aim of the experiments was to determine the optimal parameters for the 
forms used. 

A total of 2,463 forms of four different types were made available by a North 
German laboratory for the tests. They were filled out by doctors and nurses in 
various institutions for the purpose of ordering medical services. 

The four form types were first added to the case base. Important parameters for 
the identification are the reduction of the image information for forming the shape 
tree and the similarity in comparing the leaf nodes. 

The algorithms presented here were tested using various parameters for the 
reduction. The values for the horizontal and vertical reduction were varied. In the 
first test, 25 forms were used of each of the four form types. The best results were 
achieved with a horizontal reduction of 1:60 and a vertical reduction of 1:40. At 
a higher reduction, that is, less image information during the comparison, distinc- 
tion was no longer possible. At less reduction, the variation between the images 
was weighted too heavily. The necessary adjustment of the similarity also led to 
incorrect identifications. 

Using this optimal configuration, the remaining forms were processed. The 
results are displayed in Table 1 . 

Failed preprocessing was determined as the reason some forms were not rec- 
ognized. In particular, the label for alignment were not recognized. Interestingly, 
generally only one of the four corners was not found. Calculating the not identihed 
corner is useful here for the alignment. 

The implemented algorithm delivered no incorrect results. The characteristic 
elements were grouped and used to form a shape tree. The identification was per- 
formed in two steps. First, by comparing the resulting tree structure and, second, 
by comparing the formed shapes. The second comparison was only carried out 
when the comparison with a tree structure did not yield a result. The presented test 
demonstrates the efficiency of the approach using four different, but very similar, 
forms. 
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5 Discussion 

The identification of medical order forms using their layouts was described. The 
forms used here have a clear structure, which enables a high rate of recognition. 
Their use is primarily limited to the marking of orders. Due to this, the forms con- 
tain little variable data. The latter also presents a disruptive factor for identification. 
Unlike in Huang, DeMenthon, Doermann, Golebiowski, and Hamilton (2005), the 
analysis does not move from the detail level to structural connections, but instead 
the abstraction of the image serves as basis for analyzing the layout.^ By reducing 
the image information, it is possible to suppress or completely remove irrelevant 
and simultaneously disruptive data in this phase of processing. By heavily reducing 
the image information, the connected areas of the image can be summarized and 
recognized as geometrical shapes. This leads to the forming of the shape tree and at 
the same time justifies the latter use. 

The result of 97% of correctly classified forms was not expected on the basis 
of available test data. These contained, apart from the order markings, additional 
information, which were in handwriting on various areas of the page and affected 
the layout. Algorithms, which require distinctive characteristics, such as characters 
or geometrical symbols located on set areas of a form, fail is any of these symbols 
are written or glued over or are in some other way modified. The presented approach 
has, however, a high degree of tolerance to changes on the form. 



6 Summary 

This article discussed the identification of documents using the abstraction of a cre- 
ated image. The types of documents to be identified must be known to the system a 
priori. For this, the necessary characteristics are saved in a case guide as shape trees. 
This file also contains rules for possible further processing. It was shown that, in an 
extremely reduced image, it is possible to filter out the significant, characteristic 
image information and identify these using CBR. By example of experiments using 
medical order forms, the method was demonstrated and shown effective. The reso- 
lution selected for scanning, as well as the parameters for abstracting the form, are 
key to the quality achieved. Based on the achieved results, the described approach 
here demonstrates a solution for identifying documents using their layout where 
identification via conventional elements, such as barcodes, fails. The CBR methods 
employed have proven themselves as suitable 



* In Huang et al. (2005), a process is presented in which documents are compared by their lay- 
out with a collection of already identified documents. A total of 2,555 documents divided into 
18 classes were examined. The results varied, depending on the algorithm and document used, 
between an average of 72% and 85%. 
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On the Prognostic Value of Gene Expression 
Signatures for Censored Data 



Thomas Hielscher*, Manuela Zucknick*, Wiebke Werft, and Axel Benner 



Abstract As part of the validation of any statistical model, it is good statistical 
practice to quantify the amount of prognostic information represented by the model; 
this includes gene expression signatures derived from high-dimensional microarray 
data. Several approaches exist for right-censored survival data that measure the gain 
in prognostic information compared to established clinical parameters or biomarkers 
in terms of explained variation or explained randomness. They are either model- 
based or use estimates of the prediction accuracy. 

As these measures differ in their underlying mechanisms, they vary in their 
interpretation, assumptions and properties, in particular in how they deal with the 
presence of censoring. It remains unclear under which conditions and to which 
extent they are comparable. We present a comparison of several common measures 
and illustrate their behaviour in simulation examples and in an application to a real 
gene expression microarray data set. 

Keywords Explained variation • “Large p ■ small n” ■ Survival. 



1 Introduction 



One aim of clinical applications of statistical survival analysis is to find clinical and 
patient-characteristic factors that are prognostic of the survival time of a patient. 
With this respect, for clinicians, the desired outcome of survival analysis is the 
development of a prognostic model incorporating the clinical variables, which are 
most important for survival prognosis. The prognostic model should be both accu- 
rate and precise in the prediction of survival outcomes, that is both bias and variance 
of the predictions should be as small as possible. However, it is not obvious how 
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one can measure bias and variance in the survival context, or even how they should 
best be defined. Related to this is the question, how one should assess the prog- 
nostic information gained by a model compared to a reference. We will present 
existing approaches to estimate prediction accuracy in Sect. 2 and the prognostic 
value in Sect. 3, and bring them into context. We will discuss known problems with 
the prognostic value measures and compare them in simulations in Sect. 4. 

Gene expression microarray data are often collected in the context of clinical 
studies with the aim of finding sets of genes, which provide a risk stratification of 
the patient population with respect to expected survival rates. Microarray data are 
highly dimensional with many more variables {p) than samples («) available; this is 
known as the “large p, small «” (p ^ n) problem. In this situation, additional prob- 
lems occur, because models have to incorporate regularisation or variable selection, 
and it is even less obvious how the prognostic value of such a regularised model 
should be defined. In Sect. 5, we will present an application of LASSO penalised 
Cox proportional hazards (PH) regression (Tibshirani, 1996) as one example of such 
regularisation methods. 



2 Prediction Accuracy of Survival Models 

It is not obvious, how prediction accuracy should be defined in the survival context - 
in contrast to classification problems, where it is usually assessed by the misclas- 
sification rate E[f(x) ^ y], with /(x) denoting the classifier based on input data 
X, that can take the same values as the categorical response y. In survival analysis, 
the endpoint is a survival time, which is not always observed because of possible 
censoring. In this context, Graf, Schmoor, Sauerbrei, and Schumacher (1999) pro- 
posed to use the time-dependent Brier score (Brier, 1950) to measure prediction 
accuracy. The Brier score is the mean squared error between the observed binary 
survival status Y(t) = I(T > t) and the predicted survival probability jT„(t\x) at 
time t : 



Ex[E[iY(t) - n„(t\x)f\X = x]] = E[iI(T > t) - rtn(t\X)f]. (1) 

The notation used throughout this article follows Gerds and Schumacher (2006). 
The survival time is T , the vector of covariates X has density function fx{x) and 
corresponding distribution function Fx(x). The true conditional survival function 
is 5(?|x) = P{T > t\X = x), with corresponding density f{t\x). The marginal 
survival function 5(0 = P(T > 0 and /(O are defined equivalently. The marginal 
and conditional survival states at time point t are Y(t) and Y(t\X), respectively. M 
is a survival model, which is a set of candidate survival functions ir for S. Finally, 
if„ is an estimate of 7t based on n observations. 

Based on (1), the expected mean squared error of prediction at time t is 
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MSE^{t,jT,S) = j E[{I{T > t) — 7r{t\x))^]fx{x)dx. (2) 

Gerds and Schumacher (2006) show that (2) can be decomposed into a bias and a 
variance term, also called imprecision and inseparability: 

MSEx{t, Jt, S) = imprecision + inseparability (3) 

= j {S{t\x) - n{t\x)f fx{x)dx 

+ j S{t\x){\ - S{t\x))fx{x)dx. 

If the model is correctly specified (tt = S), then the imprecision term is 0, and 
MSEx(t, ji,S) = J S(f |x)(l — 5(? |jc)) fx{x)dx. 

Graf et al. (1999) proposed the integrated Brier score over the time interval [0, ? *] 
as a single measure of prediction accuracy for the entire interval: 

t * 

IBSAt*)= f MSEAt,7r,S)dWit) (4) 

Jo 

t* 

(imprecision + inseparability) dVF(f), 

where fF(? ) is a weight function to account for the gradual loss of information over 
time, which results from patients being removed from the risk set either due to 
censoring or because of death. 

Schemper and Henderson (2000) proposed an alternative measure based on the 
mean absolute deviation of the survival status Y{t\x)'. 

t * 

DAn = J j 2Sit\x)(l-S(t\x))fx(x)dxdW(t) (5) 

= 2 1 inseparability fi?fF(t). 

Jo 

W (t) is again a weight function to account for the loss of information over time, 
which can be defined differently than W{t). Note that because Y{t\x) ~ Bernoulli 
(5(i|jc)), its mean absolute deviation MAD{Y{t\x)) = 2S{t\x){l — 5(f|x)) is 
exactly twice its variance Var{Y(t\x)) = 5(?|x:)(l — 5(?|x)). It follows from (4) 
and (5), that if W {t) is equal to W{t), and if in addition the model is correctly 
specified (jr = S ), then the Brier score and the measure of Schemper and Henderson 
differ only by the factor 2. 

Due to different weight functions W{t) and W{t) and estimation methods, 
numerical deviances can occur between the estimates IBSxit*) and even if 

ji = S. Also, small at-risk set sizes for the largest observed time points imply, that 
the method used for integration over time introduces large variation in IBS(t*). 
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Hence, in applications throughout this article, when computing i* is set to 

the 90% quantile of observed times. 



3 Measuring the Prognostic Value of Survival Models 



In linear regression, the prognostic value of a model can be measured by the 
(adjusted) multiple correlation coefficient 



MSB 

MST' 



(6) 



which is the proportion of the mean squared error of the regression model (MSB) 
and the total variance of the response variable (MST). For survival data, can- 
not be applied in a straightforward way, partly because a measure for survival data 
has to be a function of time, and therefore depend on the time range of interest 
and also on the available follow-up time. In addition, survival data are subject to 
censoring, and an R^ measure should be approximately independent of censoring, 
provided that the censoring mechanism itself is independent of the failure mecha- 
nism. We introduce and compare several existing approaches for survival data. Note 
that by R^ we always refer to the empirical measure estimated from the data. 

As both IBSxit*) and Dx(t*) are interpreted as measures of variability for sur- 
vival data, both measures have consequently been proposed for constructing i?^-like 
measures for survival models: 



R]Bsin = 1 - 



IBSxit*) 

f^Soit*) 



and Rjj(t*) = 1 — 



DAn 

Doit*)’ 



(7) 



with IBSoit*) and Doit*) being the corresponding measures based on the marginal 
survival distribution Sit) rather than S(f|X). These coefficients give the relative 
gain in prediction accuracy compared to the marginal model. They are referred to as 
measures of explained variation, because they relate the variability remaining in the 
error distribution of the model of interest with the total variability in the marginal 
model. 

An alternative approach was proposed by Cox and Snell (1989): 



Rm = 






-Ijn 



( 8 ) 



where L(y6) and L(0) are the (partial) likelihoods of the fitted and the null models, 
respectively, and where /()3) = logL(yS) and 1(0) = logL(O). Advantages of this 
approach include, that in linear regression is consistent with the ordinary R^, 
that maximum-likelihood estimators maximise Rj^, and that R\ is asymptotically 
independent of sample size. A disadvantage is, that the interpretation is not clear, 
e.g., it cannot be interpreted as a percentage of variation explained, not least because 
in models other than the linear model, the range [0, 1] is not fully exploited. For this 
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reason, Nagelkerke (1991) proposed the following modification 

Rl = Rl/ (9) 

to ensure, that the range [0, 1] is exploited. However, this does not address another 
disadvantage, namely that R\ and R^^ are negatively correlated with the amount of 
censoring. To improve independence of censoring, O’Quigley, Xu, and Stare (2005) 
proposed to replace the number of observations n by the number of events e: 

( A \ —2je 

iw) ■ 

As a beneficial side effecl, R\)xs exploifs the range [0, 1] more than but the 
maximum possible value is still less than 1 . 

Kent and O’Quigley (1988) use the Kullback-Leibler distance between the 
model of interest and the null model, thus interpreting the problem as one of 
measuring the information gain. The Kullback-Leibler distance is given by twice 
the expectation of the log-likelihood function under = P, that is r(y6) = 

2(7(4) - ^(0)) with 

/(y6) = / j log{f{t\x,P})dS{t\x,P)dFx{x). (11) 



The R^ coefficient is defined as 

= l-exp(-r(4)) (12) 

to keep the interpretation of a measure ranging between 0 (for no information 
gained) and 1 (perfect explanation of the response variable by the model). Rj^Q is 
referred to as a coefficient of explained randomness (Kent & O’Quigley), in order to 
distinguish this information gain approach from the explained variation measures. 
Note that in the absence of censoring, a standard estimate of T{P) would be the 
usual log likelihood ratio statistic divided hy n, where dS(t\x, P)dFx{x) is replaced 
by the observed empirical distribution of (X, T). In that case, R\q would be equiv- 
alent to R\ and also R^qxs- However, Kent and O’ Quigley proposed a different 
estimate for T{fi) that is consistent also when independent censoring is present. 
The information-gain approach is very general and does not place any distributional 
assumptions on the data. However, to ease the computational burden, Kent and 
O’Quigley suggest to assume a Weibull survival model. Nevertheless, O’Quigley 
et al. (2005) showed that R^qxs ^nd R\q are generally close, even though R^qxs 
does not require this assumption. 

Finally, the approach by Xu and O’Quigley (1999) is introduced. It is restricted 
to proportional hazards regression models, because here the means of squared resid- 
uals MSE in the R^^. measure for linear regression are replaced by the (weighted) 
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Fig. 1 Graphical representation of the inter-relations between measures for survival data 



sums of squared Schoenfeld residuals, denoted by 



R 



2 

xo 



m 

/(O)' 



(13) 



Note that the use of Schoenfeld residuals implies that the “null” residuals in 
/(/6 = 0) still depend on the data. R\q has a different interpretation than the other 
coefficients, because Schoenfeld residuals give the variation in X given t rather than 
the variation in survival times 5(t |Z). This “reversed” approach has the advantage, 
that time-dependent covariates can readily be incorporated. For independent censor- 
ing, the values of R\q are generally close to R\q, despite the reversed approach, 
and is thus also interpreted as an approximation to R\q- 

Figure 1 summarises some of the relations between the R^ measures introduced 
above. 



4 Low-Dimensional Data: Simulation Example 

We compare the R^ coefficients in a simulation example. As R\ and are very 
similar here (since max(i?|) Ri 1), only R\j will be reported, as it is more widely 
used. Exponential survival times are simulated for n = 1,000 patients depend- 
ing on two normally distributed covariates X\ ~ N{3, 3) and X 2 ~ A^(0, 1) 
according to a Cox PH model with constant baseline hazard ho = log{2) f t^ed, 
where is the median time to event. The maximum observation time is set to 
tmax = 3 years, and additional censoring is introduced at uniformly random time 
points. In a first scenario, the regression coefficients are set to fixi = log(4) and 
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Fig. 2 Boxplots of measures for simulations with large {left) and small {right) effect sizes 



Px 2 = log(2) to simulate a model with large effects and large prognostic value. In a 
second scenario, the simulation is repeated with small effect sizes (Pxi = log(1.2) 
and Px 2 = log(l-2)) to generate a model with a much smaller and more realis- 
tic prognostic value. To illustrate the dependence of the measures on the amount 
of independent censoring, the simulations are performed for three scenarios: with 
no censoring (here tmax oo), with approximately 25% censoring and with about 
45% censoring, respectively. Different amounts of censoring are induced by varying 
the median survival time tmed ■ The simulations are repeated 5 = 1 ,000 times. 

Figure 2 shows the estimates as boxplots. The measures derived from 
IBSx(t*) and Dx(t*) take systematically lower values than all other coefficients. 
In the large-effect scenario, their observed median values across all simulations are 
close to 0.7, while the median values of the model-based coefficients vary from 
ca. 0.85 to more than 0.95. In the small-effects model, the same picture emerges, 
albeit at much lower levels (median values are between 0.1 and 0.25). The vari- 
ances of the R^ values are larger in the small-effects models than observed for the 
large-effects models. They are largest for R^xs’ ^\o increase with 

increasing amounts of censoring. In summary, the large differences between the 
R^ measures suggest that one should not attempt to interpret these measures with 
respect to some absolute scale, something that has been proposed in the literature 
(e.g., Dunkler, Michiels, & Schemper, 2007). 

We observe that the measures differ in the extent, to which they are affected by 
censoring. Nagelkerke’s coefficient is most affected; we confirm the negative corre- 
lation with the amount of censoring reported in the literature. A second noticeably 
influenced measure is R^xs’ which we observe positive correlation with cen- 
soring. The Kullback-Leibler coefficient Rj^p does not seem to be affected at all, 
except for a slight increase in its variance in the small-effects model. The measures 
based on prediction accuracy are also influenced very little. R^xs’ 
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are good approximations to R\q in the simulation examples without censoring, 
but underestimate R\q somewhat. With increasing amount of censoring, R\)xs in 
particular starts to increase and in the large-effects model starts to deviate from R\q. 



5 High-Dimensional Data: Lymphoma Application 

The lymphoma survival microarray study by Rosenwald, Wright, Chan, Connors, 
Campo, et al. (2002) includes gene expression data for 7,399 genes. We analyze 
222 patients, which were assessed according to a standard clinical risk factor, the 
International Prognostic Index (IPI). The median follow-up time was 7.8 years. Dur- 
ing that time, 127 Patients (57%) died with a median time to death of 4.1 years. 
This is a typical example, in that the objective is to find a prognostic model built 
from gene expression data to predict overall survival, accounting for already known 
clinical factors (in this case IPI). To improve interpretability, we are interested in 
finding sparse gene signatures, that is prognostic models containing only a small 
number of genes. A suitable method is LASSO penalised Cox proportional hazards 
(PH) regression (Tibshirani, 1996). It combines parameter estimation and vari- 
able selection by optimising the penalised likelihood l(P*) — A \Pf\, where 
P* = (P^. P^^^) combines the coefficient for IPI and the gene coefficient vector 
P^ . The penalty parameter A is optimised via five-fold cross-validation. The model 
built in this way will be called the full model, while the Cox PH model with IPI as 
the only covariate will be referred to as the clinical model. 

Applying the prediction-accuracy-based measures to penalised models with p 
n data is straightforward. Note however, that D^{t*) assumes a correctly speci- 
fied model, while in fact the penalty introduces a downward bias in the coefficient 
estimates, introducing model misspecification. In addition, for high-dimensional 
data, the Cox PH model assumptions cannot realistically be assessed for all p 
covariates, and some covariates are likely to violate them. The same is true for 
the assumptions of a Weibull model, that are made when computing R\q (Kent 

6 O’Quigley, 1988). For the likelihood-based coefficients R\, Rj^ and R^xs’ 
keep the definitions from Sect. 3, that is the unpenalised log-likelihoods are used 
to summarise the models. A possible alternative would be to utilise the penalised 
log-likelihoods and we will make these modified measures a subject of future inves- 
tigations. Rxo is straightforward to adapt to penalised models, by replacing the 
maximum-likelihood estimates in the computation of the Schoenfeld residuals by 
the maximum-penalised-likelihood estimates. 

According to Binder and Schumacher (2008) we applied the .632 bootstrap 
method without replacement for R^ estimation. The R^ estimates (summarised as 
boxplots in Fig. 3) lead to conclusions similar to those obtained for low-dimensional 
simulations in Sect. 4. The values of the coefficients based on prediction accuracy 
are again lower than of the other measures. The Rj,g estimates are close to both 
R^xs ^xo despite the assumption of a Weibull distribution. Nagelkerke’s Rjj is 
much smaller than the other model-based coefficients, no doubt because of the large 
amount of censoring present in the data. In summary, one can again not interpret the 
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Fig. 3 Boxplots of 0.632-bootstrapped values for the clinical model and full model in the 
lymphoma application 



values on an absolute scale. For example, if one would use the scale proposed 
for gene expression applications by Dunkler et al. (2007), i.e., designate all val- 
ues below 20% as describing a “weak” prognostic value, between 20% and 39% as 
“medium”, between 40% and 59% as “strong” and all values > 60 as “very strong”, 
then the full model would be just barely above the “weak” effect mark when using 
but would be declared to have a strong effect based on R^oxs- 
All measures indicate a clear improvement in the prognostic value, when compar- 
ing the full model to the clinical model: the inter-quartile ranges of the bootstrapped 
R^ coefficients of both models do not overlap for any of the measures. Hence, 
independently of the choice of R? measure, one concludes that including the gene 
expression signature does improve the prognostic information of the survival model 
over the IPI-only model. Note that the variability in the bootstrap samples differs 
greatly between the measures. The variability is lowest for R\ and R\q and largest 
for R\)xs- The results of the small-effects simulation in Sect. 4 suggest, that the 
latter might be due to the large amount of censoring in the lymphoma data set. 



6 Conclusions 

It was the aim of this paper to present existing approaches to measure the prog- 
nostic value of survival models in a common framework and to outline connections 
between them. Measures are either based on prediction accuracy or are linked to the 
model likelihood. The latter are dependent on the chosen model itself, which implies 
that these coefficients can only be computed for models with an existing likelihood, 
while the prediction-focussed measures are available for any prognostic system that 
makes predictions about survival probabilities, including expert systems devised ad 
hoc without statistical modelling. This has been pointed out by Graf et al. (1999) 
as an advantage of their R^j^g coefficient. In our simulations and in the lymphoma 
example, we observed that the R^ coefficients based on prediction accuracy clearly 
take smaller values than the model-based measures. This leads us to conclude that 
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using absolute reference values for interpreting the observed prognostic value is not 
to be advised in general, without taking into account, which R} method is being 
used and how the data are modelled. 

The question remains, which measure is to be preferred in a given situa- 
tion. This depends on the model framework, within which the prognostic values of 
models are to be compared. If one intends to stay within the framework of Cox 
regression, then R\q might be the coefficient of choice, not least because it allows 
for the incorporation of time-dependent covariates. However, to allow for compar- 
isons with ad hoc prognostic systems not based on statistical modelling, one has to 
resort to prediction-accuracy based coefficients. When the observed survival times 
are censored, those measures should be used, which are least affected by censoring. 
With that respect, of all model-based measures, R\q preferred 

for censored data. Nagelkerke’s coefficient, which is the standard output of R sur- 
vival analysis functions coxph and cph, depends most heavily on the censoring 
mechanism and should thus be avoided. Note that in any case censoring can only be 
accounted for, if it is (conditionally) independent of the failure mechanism. 

We conclude by noting, that in the high-dimensional lymphoma application of 
penalised Cox PH regression, the observed patterns between the R? coefficients 
are similar to those in the low-dimensional simulation examples with unpenalised 
Cox models. Care has to be taken, when interpreting the likelihood-based R? coef- 
ficients in the context of penalised models, as some properties of the measures do 
not carry over. For example, the coefficients do not take their maximum at ^6 , if 
is the maximum-penalised-likelihood solution rather than the maximum-likelihood 
estimate. 
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Quality-Based Clustering of Functional Data: 
Applications to Time Course Microarray Data 



Theresa Scharl and Friedrich Leisch 



Abstract Cluster methods are typically applied to time course gene expression data 
to find co-regulated genes which can finally help to reveal pathways and interactions 
between genes. Clustering is either carried out on the raw data or on functional data. 
In functional data analysis a curve is fit to each observation in order to account for 
time dependency. As gene expression over time is biologically a continuous process 
it can be represented by a continuous function. The different curve shapes found in 
a dataset can have important interpretations and characteristic patterns can be found 
by clustering the estimated regression coefficients. 

In this simulation study on artificial data the well-known K-Means algorithm as 
well as the quality-based cluster algorithm QT-Clust are applied to both the raw data 
as well as functional data. The performance of the different methods is evaluated 
when different types of noise are added to the data. All cluster algorithms used are 
implemented in R. 

Keywords Cluster analysis • Functional data • Time course gene expression 
data • R. 



1 Introduction 



Clustering is frequently applied in the analysis of time course gene expression 
microarray data for the first investigation of the data before focussing on special sub- 
groups of interest. As functionally related genes are likely to be co-expressed (e.g., 
Eisen, Spellman, Brown, & Botstein, 1998) clustering groups of genes with simi- 
lar expression pattern can help to reveal the function of previously uncharacterized 
genes. Finally gene clusters can help to reveal pathways and interactions between 
genes. 
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In the literature many cluster methods have been proposed and applied to microar- 
ray data. Many detailed reviews of currently used methods and the main challenges 
with gene expression data are available (e.g., Androulakis, Yang, & Almon, 2007; 
Kerr, Ruskin, Crane, & Doolan, 2008; Sheng, Moreau, Smet, Marchal, & Moor, 
2005). Classical algorithms like K-Means, partitioning around medoids or self- 
organizing maps are used as well as model-based clustering (Fraley & Raftery, 
1998) or quality-based clustering (Heyer, Kruglyak, & Yooseph, 1999; Smet et al., 
2002). Clustering is either carried out on the raw data or on functional data. In func- 
tional data analysis (Ramsey & Silverman, 1997) a curve is fit to each observation in 
order to account for time dependency. As gene expression over time is biologically 
a continuous process it can be represented by a continuous function. The different 
curve shapes found in a dataset can have important interpretations and characteristic 
patterns can be found by clustering the estimated regression coefficients. Functional 
data is commonly clustered using the K-Means algorithm (Abraham, Cornillon, 
Matzner-Lober, & Molinari, 2003; de Hoon, Imoto, & Miyano, 2002; Hakamada, 
Okamoto, & Hanai, 2006; Serban & Wasserman, 2005; Tarpey, 2003, 2007). 

Many different cluster methods have been applied to microarray data. However, 
comprehensive comparative studies of gene clustering methods are rare. One exam- 
ple of such as study is given in Thalamuthu, Mukhopadhyay, Zheng, and Tseng 
(2006) where gene clustering methods are compared using both artificial and real 
microarray data. In their paper gene expression over time was modeled by piece- 
wise constant functions. In this paper gene expression over time is represented 
by a continuous function and therefore clustering functional data is useful. As the 
comparison of cluster methods depends very much on the specific use of the meth- 
ods as well as on the data itself cluster methods should rather be evaluated than 
compared Androulakis, Yang, & Almon (2007). For this reason an extensive inves- 
tigation of cluster methods on artificial datasets is performed in this simulation study 
where the true cluster memberships are known. 

The quality-based cluster algorithm Stochastic QT-Clust (Leisch, 2006; Scharl 
& Leisch, 2006) is compared to the K-Means algorithm using both raw data and 
functional data. The goal of the study is to investigate if quality-based clustering of 
the estimated regression coefficients yields more reliable results than clustering the 
raw data. 

Microarray data are typically very noisy data as technical artefacts can easily 
distort the data. Additionally large sets of genes are usually unaffected by the exper- 
iment and do not show differential expression over time. In this simulation study 
the performance of the different cluster approaches is evaluated on artificial datasets 
when different types of noise are added to the datasets. 

All cluster algorithms used as well as the artificial data generator are imple- 
mented in R (R Development Core Team, 2009, http://www.R-project.org). 
R Package f lexclust (Leisch, 2006) contains extensible implementations of the 
k-centroids and QT-Clust algorithm (Heyer, Kruglyak, & Yooseph, 1999; Leisch, 
2006; Scharl & Leisch, 2006). Cluster algorithms are treated separately from 
distance measures and new distance measures and centroid computations can eas- 
ily be incorporated into cluster procedures. R package gcExplorer (Scharl & 
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Leisch, 2008) contains functionality for the graphical exploration of gene clus- 
ters and an artificial time course microarray data generator. The latest release 
of gcExplorer is always available at the Comprehensive R Archive Network 
CRAN : http://cran.R- project. org/package=gcExplorer. 



2 Methods 

2.1 K-Means Clustering of Functional Data 

The standard application of K-Means is to assign data points to clusters based on 
minimal Euclidean distance to the cluster centers. However, observations over time 
are not just ordinary points in Euclidean space but curves with distinct shapes. 
Clustering functional data using the K-Means algorithm (Tarpey, 2003, 2007) is 
very useful to determine representative curve shapes in a functional dataset. This 
approach is frequently used in clustering microarray data (e.g., Abraham, Cornillon, 
Matzner-Lober, & Molinari, 2003; de Hoon, Imoto, & Miyano, 2002; Hakamada, 
Okamoto, & Hanai, 2006; Serban & Wasserman, 2005) where different methods are 
used to fit curves to the data. 

In this paper a cubic spline using a B-spline basis is fit to each gene expres- 
sion profile and the estimated regression coefficients are plugged into the K-Means 
algorithm. 



2.2 Quality-Based Clustering of Functional Data 

The focus of quality -based clustering (Heyer, Kruglyak, & Yooseph, 1999) is to 
hnd large clusters of a certain quality, i.e., to find clusters whose diameter does not 
exceed a given threshold value. So quality-based clustering overcomes some of the 
problems of classical cluster algorithms like K-Means, e.g., the predefinition of the 
number of clusters. Additionally data points are not forced into a cluster when the 
similarity to other cluster members is low. 

Here quality-based clustering is not only applied to the raw data, i.e., the gene 
expression value at several time points. Additionally each expression profile is mod- 
eled as a cubic spline using a B-spline basis in order to account for time dependency 
and the cluster algorithm is performed on the parameters. 



3 Simulation Design 

The framework of this simulation study is the following: 

1. Generate 100 sets of cluster centers 

2. Add different types of noise to the cluster centers 



678 



T. Scharl and F. Leisch 



3. Perform cluster analysis using K-Means or QT-Clust: 

• Either on the raw data 

• Or represent each gene expression pattern by a curve and perform the cluster 
algorithm on the parameters 

4. Evaluate the performance of different cluster methods on the datasets where the 
noise set of genes is omitted using the adjusted Rand index (Hubert & Arabie, 
1985) 



3.1 Integrated AR Processes for Simulated Data 

The cluster centers are created using integrated autoregressive models because inte- 
grated AR( 1 ) processes resemble the shape of gene expression over time observed in 
real time course data very well. An autoregressive process Xt of order 1 is defined by 



Xt — OlXt-\ + €t, 



where €t is a series of uncorrelated random variables with mean 0 and variance a^. 
It describes how each observation is a function of the previous observation. 

An integrated AR(1) process is a process whose dih difference is an AR(1) pro- 
cess. d = 0 the observations are modeled directly, if d = \ the differences 
between consecutive observations are modeled, i.e.. 



Xt — Xt-i -|- a{Xt-i — Xt-2) + (t- 



If d = 2 the differences of differences are modeled, etc. 

In the artificial data generator available in R package gcExplorer parameter d 
is either 1 or 2 in order to get different degrees of smoothness. Half of the generated 
time series are then reversed and finally transformed to the range of typical gene 
expression profiles. 

One set of cluster centers consists of 15 expression patterns yielding datasets of 
15 clusters (as used in Thalamuthu et al., 2006) with dimension (number of time 
points) 16. 

The datasets are then created by adding noise to the cluster centers. The expres- 
sion pattern y of each gene i in a given cluster k is assumed to follow the shape of 
the cluster center fik but with a gene specific shift (?, (specified by the noise param- 
eter “SD of random intercept”). Additionally a normally distributed measurement 
error €ij (specified by the noise parameter “SD of mean of genes”) is added to each 
observation (time point) j . 



yij = iik{tj) + hi + €ij, 



where hi ~ A(0, and eij ~ A(0, ct^). 



Quality-Based Clustering of Functional Data 



679 



Table 1 Overview of the varying noise parameters 





Noise level 


Low 


Medium 


High 


^iJ 


SD of mean of genes 


0.1 


0.3 


0.5 


N 


Number of noise genes 


100 


500 


1,000 


(Tfn 


SD of mean of noise genes 


0 


1 


2 


hi 


SD of random intercept 


0.1 


0.7 


1.5 



As typical gene clusters do have arbitrary cluster sizes all simulated datasets con- 
sist of clusters of sizes between 10 and 100 (three clusters of sizes 10, 20, 30, 50 and 
100) yielding a total of 630 genes with defined cluster pattern. Finally an additional 
noise set of genes of specified size N (given by the noise parameter “number of 
noise genes”) is added to the data. Each noise gene is generated as n, ~ N(mi,a^) 
where m, ~ N(0. a^) and a„ ~ C/(0.1, 0.3). a„, is specified by the noise parameter 
“SD of mean of noise genes”. 

An overview of the different noise parameters is given in Table 1 . One set of 
cluster centers is used to generate 81 datasets using all possible combinations of 
noise parameters. 

An example of such a dataset is given in Fig. 1 . The dataset consists of 1 5 clusters 
with high SD of mean of genes, i.e., large deviation from the cluster centers and a 
noise set of genes located around 0. 



4 Simulation Results 

The adjusted Rand index between the true cluster memberships and the cluster 
solutions is compared for the different cluster methods for several noise levels. Fig- 
ure 2 shows boxplots of the adjusted Rand index of 100 artificial datasets when 
low, medium and high noise (as defined in Table 1) is added to the cluster cen- 
ters. In the bottom panel of Fig. 2 the cluster methods are compared when the low 
noise level is added to the datasets. Here the performance of QT-Clust outperforms 
K-Means on both the raw data as well as on the functional data. For low noise level 
all cluster methods perform well and clustering functional data slightly outperforms 
clustering the raw data. In the middle panel the medium noise level is added to the 
datasets. In this case the performance of K-Means on functional data shows the best 
performance followed by QT-Clust on functional data. Again clustering the func- 
tional data outperforms clustering the raw data for both cluster algorithms. In the 
top panel where high noise level was added to the datasets the trend of the medium 
noise level is continued. However the performance of all cluster methods is very 
poor. 

Finally the performance of the cluster methods is shown when only one type of 
noise is present in the data (Fig. 3). In the top panel of Fig. 3 the adjusted Rand 
index of datasets with a large SD of mean of genes from their cluster centers is 
shown. In this case K-Means on the raw data clearly outperformed all other methods 
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Fig. 1 Example of an artificial dataset: 15 clusters with large deviation from the cluster centers 
and a noise set of genes located around 0 



followed by QT-Clust on raw data. Both K-Means and QT-Clust on raw data show 
a high variability in their performance. This is due to the fact that such a large 
SD of the mean of genes results in clusters with no clear expression pattern and 
cluster results are almost arbitrary. However, this is the only setting where clustering 
the functional data is obviously not the method of choice. In the second panel the 
adjusted Rand index of 100 datasets is plotted when a noise set of 1,000 genes is 
added to a dataset with low noise level in the other three noise parameters. In this 
case the performance of all four cluster methods is good, so the number of noise 
genes does not affect their performance much. Almost the same is true in the case 
of a large mean of the noise genes (third panel). However the high variability in the 
performance of QT-Clust on raw data is striking. This is probably due to technical 
details of the algorithm: Clusters are crated by randomly picking cluster centers and 
adding as many genes as possible without surpassing the quality threshold. And 
therefore picking noise genes as cluster centers might lead to noisy cluster results. 
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Fig. 2 Adjusted Rand index on 100 datasets with low, medium and high noise level added to the 
cluster centers 



Finally in the case of a large SD of the random intercept, i.e., the gene specific shift 
(bottom panel) clustering functional data using both K-Means and QT-Clust yields 
much better results than clustering the raw data. This was expected as genes with 
similar expression patterns but on different expression levels can more easily be 
found by clustering the estimated regression coefficients. 



5 Summary 

In this paper a simulation study on artificial data was presented evaluating the per- 
formance of different cluster methods. This work should give practitioners an idea 
what they can expect from clustering time course microarray data. Artihcial data 
was used where different types of noise were added to the data and where the true 
cluster memberships are known. 
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Fig. 3 Adjusted Rand index on 100 datasets when only one type of noise is present in the data 



The simulation study showed that clustering of functional data outperforms clus- 
tering of raw data in most scenarios. When low noise level is present in the data 
QT-Clust yields better results than K-Means on both types of data. On the other 
hand K-Means is more robust to noisy datasets. 

It was shown that large noise sets of genes (i.e., genes that do not play a role in 
the experiment under investigation) do not affect the cluster result much. All four 
cluster methods had a large agreement between the cluster solutions and the true 
cluster membership even when a large number of noise genes were present in the 
data. This indicates that different ways of filtering the data do usually not yield 
different results. 

On the other hand, noise that affects the expression pattern of genes or the gene 
specific shift has a large impact on the cluster solution. If only a gene specific shift 
is expected in the data clustering the functional data should be performed as genes 
shifted along the y-axis still have similar estimated regression coefficients. If the 
gene expression patterns are noisy clustering the raw data is preferable. 
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In the future further cluster methods especially model-hased clustering will be 
included in the simulation study in order to find the most appropriate method for 
each setup. And finally the different cluster methods will be evaluated on a real- 
world dataset from E. coli including biological knowledge of the organism as well 
as findings from previous experiments. 
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ceutical Technology (ACBT). 



References 



Abraham, C., Cornillon, P.-A., Matzner-Lober, E. & Molinari, N. (2003). Unsupervised curve 
clustering using B-splines. Scandinavian Journal of Statistics, 50(3), 581-595. 

Androulakis, I., Yang, E., & Almon, R. (2007). Analysis of time-series gene expression data: Meth- 
ods, challenges, and opportunities. Annual Review of Biomedical Engineering, 9, 205-228. 

de Hoon, M. J .L., Imoto, S., & Miyano, S. (2002). Statistical analysis of a small set of time-ordered 
gene expression data using linear splines. Bioinformatics, fS(ll), 1477-1485. 

Eisen, M. B., Spellman, P. T, Brown, P. O., & Botstein, D. (1998). Cluster analysis and display 
of genome-wide expression patterns. Proceedings of the National Academy of Sciences of the 
United States of America, 95, 14863-14868. 

Eraley, C., & Raftery, A. (1998). How many clusters? Which clustering method? Answers via 
model-based cluster analysis. The Computer Journal, 41(S), 578-588. 

Hakamada, K., Okamoto, M. & Hanai, T. (2006). Novel technique for preprocessing high 
dimensional time-course data from DNA microarray: mathematical model-based clustering. 
Bioinformatics, 22(1) 843-848. 

Heyer, L. J., Kruglyak, S. & Yooseph, S. (1999). Exploring expression data: identification and 
analysis of coex pressed genes. Genome Research, 9, 1 106-1 1 15. 

Hubert, L., & Arabic, P. (1985). Comparing partitions. Journal of Classification, 2, 193-218. 

KeiT, G., Ruskin, H. J., Crane, M., & Doolan, P. (2008). Techniques for clustering gene expression 
data. Computers in Biology and Medicine, 38(3), 283-293. 

Leisch, F. (2006). A toolbox for k-centroids cluster analysis. Computational Statistics and Data 
Analysis, 51(2) 526-544. 

R Development Core Team. (2009). R: A language and environment for statistical computing. 
Nienna, Austria (ISBN: 3-900051-07-0). 

Ramsey, J. O., & Silverman, B.W. (1997). Functional data analysis. New York: Springer. (ISBN 
0-387-94956-9). 

Scharl, T. & Leisch, F. (2006). The stochastic qt-clust algorithm: evaluation of stability and 
variance on time-course microan'ay data. In A. Rizzi & M. Vichi (Eds.), Compstat 2006 — 
proceedings in computational statistics (pp. 1015-1022). Heidelberg: Physica. 

Scharl, T., & Leisch, E. (2008). Using neighborhood graphs for the investigation of E. coli gene 
clusters. In M. Ahdesmaki et al. (Eds.), Proceedings of the 5th international workshop on com- 
putational systems biology, WCSB 2008 (June 11-13, 2008, Leipzig, Germany) (pp. 157-160). 
Tampere, Einland: Tampere University of Technology. 

Serban, N., & Wasserman, L. (2005). Cats: Clustering after transformation and smoothing. Journal 
of the American Statistical Association, 100(411), 990-999. 

Sheng, Q., Moreau, Y, Smet, E. D., Marchal, K., & Moor, B. D. (2005). Advances in cluster 
analysis of microarray data. In E. Azuaje, & J. Dopazo (Eds.), Data analysis and visualization 
in genomics and proteomics. New York: Wiley (ISBN 0-470-09439-7). 



684 



T. Scharl and F. Leisch 



Smet, F. D., Mathys, J., Marchal, K., Thijs, G., Moor, B. D. & Moreau, Y. (2002). Adaptive quality- 
based clustering of gene expression profiles. Bioinformatics, 18(5) 735-746. 

Taipey, T. (2003). Clustering functional data. Journal of Classification, 20, 93-114. 

Taipey, T. (2007). Linear transformations and the k-means clustering algorithm: Applications to 
clustering curves. The American Statistician, 61, 34^0. 

Thalamuthu, A., Mukhopadhyay, I., Zheng, X., & Tseng, G. C. (2006). Evaluation and comparison 
of gene clustering methods in microarray analysis. Bioinformatics, 22(19), 2405-2412. 



A Comparison of Algorithms to Find 
Differentially Expressed Genes 
in Microarray Data 



Alfred Ultsch, Christian Pallasch, Eckhard Bergmann, 
and Holger Christiansen 



Abstract There are several different algorithms published for the identification of 
differentially expressed genes in DNA microarray experiments. Such algorithms 
produce ordered lists of genes. To compare the performance of these algorithms 
established measurements from Information Retrieval are proposed. A benchmark 
data set with known properties is generated and published. This benchmark data 
is used to compare the performance of different algorithms with a new algo- 
rithm, called PUL. Surprisingly a clear ordering in performance of the algorithms 
was observed. PUL outperformed other algorithms by a factor of two. PUL was 
applied successfully in different practical applications. For these experiments the 
importance of the genes identified by PUL were independently verified. 

Keywords Differentially expressed genes • DNA mircroarray data. 



1 Introduction 

In the experiments considered here the aim is to identify in DNA microarray data 
those genes that are most relevant for the distinction between for two different 
populations (groups). Typically a very large number of genes is measured on a 
microarray. However, only a small percentage of the measured genes is differentially 
expressed with respect to the two groups. Algorithms for the detection of differen- 
tially expressed genes usually produce an ordered list of genes. First on this gene list 
should be the most differentially expressed genes. Those genes should account for 
the differences in the two groups. A very hard problem is where to cut this list such 
that the rest of the list (=genes) is not important for the group distinction. Since 
the number of measurements is large, the problem of false positive identification of 
genes is substantial. In order to compare the performance of different algorithms we 
propose here to use methods from Information Retrieval (IR). The identification of 
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differentially expressed genes can be compared to a query for a search engine such 
as Google® for relevant web pages. The aim of such a query is, to hnd as much rele- 
vant documents as possible. In IR this is measured as Recall. Furthermore one does 
not want to he overwhelmed by information that is irrelevant to the given query. In 
IR this is measured as Precision. A retrieval algorithm is perfect if it delivers 100% 
in Precision and Recall. The advantage of this approach is that the performance of 
the algorithms can be compared for all lengths of the gene lists. The difficult prob- 
lem to decide, when differences among populations are irrelevant, is thus avoided. 

The comparison of different algorithms is performed using a benchmark data set 
with known properties. Section 2 describes the benchmark data set. In Sect. 3 the 
main characteristics of the published algorithms for differentially expressed genes 
are given. Section 4 introduces the new method called PUL. In Sect. 5 the IR meth- 
ods for the comparison of the results are introduced. Section 6 presents the results. 
Discussion and summary round up the paper. 



2 Benchmark Data Set 

In order to compare different methods for the identification of signihcant genes, a 
data set with known properties is used. Starting point for the benchmark data is a 
measured data set from an experiment on neuroblastoma (Berwanger et ah, 2002). 
There are 18 microarrays from patients with tumor stage 1 and 15 microarrays for 
patients with tumor stage 4. The genes were ordered in decreasing differences in 
population means. From the top 46 genes (0.1% of all genes) of this list eight 
genes with a positive difference between the groups and eight with a negative dif- 
ference were randomly drawn. This set of 16 definitive DE genes was replicated 
hve times. For each replication from 10% to 50% of values were replaced by mea- 
surements drawn randomly from genes with no difference in expression between 
the two groups. This gave a set of 6 x 16 = 96 genes which have a decreasing 
level of difference in over- resp. under-expression in the two groups. A set of 3,738 
unexpressed genes were randomly drawn from the data such that the gene expres- 
sion showed no signihcant differences between stage 1 and stage 4. This was tested 
using array wise permutations of the data. A set of 645 genes were randomly drawn 
from a list ordered according the difference in means. For this, only genes with ranks 
between 200 and 1,000 on this list were used. This ensures that the benchmark data 
also contains genes with borderline DE. Technical details of the construction of 
the benchmark data set, called NBD are given in Ultsch (2007). In summary the 
benchmark data set NBD consists of the following types of gene expression data: 2 
Classes, 18 arrays in population 1,15 arrays in population 2: 



3,738 


Unexpressed genes 


645 


Borderline genes 


96 


Differential expressed genes 


4,479 


Genes in total 
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Of the differential expressed genes 48 are overexpressed in population 1 and 48 
underexpressed in population 1. The data set can be obtained as supplementary 
material from our web page or by e-mail request. 



3 Popular Algorithms to Identify Differentially 
Expressed Genes 

Several different approaches have been published to find differences in the expres- 
sions of genes on DNA microarrays for two different groups. The simplest approach 
is to apply a z-transformation to each array’s measurement and calculate the two- 
sample t-statistic (ZT). Details can be found, for example, in Dudoit, Fridlyand, and 
Speed (2000). Significance Analysis of Microarrays (SAM) uses a modified t-like 
statistic (Tusher, Tibshirani, & Chu, 2001). The version 1.25 of SAM as R pack- 
age (samr) was used here. Pattern Analysis of Microarrays (PAM) is described in 
Beckers, Herrmann, Rieger, Drobyshev, Horsch, et al. (2005). The PAM developers 
claim, that the PAM method to identify differentially expressed (DE) genes gives 
more consistent results than SAM. The PAM software was obtained in Septem- 
ber 2007 from http://www.helmholtz-muenchen.de/en/ieg/group-gene-regulation/ 
technical-aspects/natural-variability/index. html. Empirical Bayes methods, using 
the log posterior odds ratio that a gene is DE vs. not DE, produce the so called 
B-statistics (Lbnnstedt & Speed, 2001). The procedures for B-statistics that are 
implemented in the limma R package version 2.9.17 using R. Version 2.4.1 for 
Windows (Smyth, 2004) were used here. To account for multiple testing, control 
the false discovery rate, account for correlation of the variables (genes) and relieve 
the normality assumption underlying the statistics, a standard procedure of using a 
step down algorithm was applied (Westfall & Young, 1993, Algorithm 4.1). Usually 
the data is permuted array wise p times. The t- (or B-) statistics are also calculated 
for this randomized data. The randomized value is then compared to the value from 
the non permuted experiment. The adjusted p-value (APV) is the fraction of t val- 
ues from the populations not exceeding the values from the randomized data. The 
adjusted /7-values model the error probability, that a gene is regarded to be differen- 
tially expressed. In the following a threshold for APV of 0.05 is used on p = 500 
permutations for all algorithms. 



4 The PUL Method to Identify DE Genes 

Here we describe the PUL method to identify differentially expressed (DE) genes. 
PUL stands for “empirical Bayes probabilities of unit transformed log normal 
microarray data”. There are four main ideas in this method: first, unit-transformation; 
second, mixture modeling using a mixture of a Gauss and two log normal dis- 
tribution, third, calculation of Bayes posterior probabilities and finally, scoring 
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for DE genes based on averaged probabilities of membership in the set under- or 
overexpressed genes. 



4.1 Unit Transformation 

In order to compare the measurements of different microarrays (slides) the ranges 
of the measurements must be adapted. Often a z-transformation is used for this. The 
result of a z-transformation is a mean of zero and unit variance for each array. With 
this transformation it is, however, not guaranteed that the distributions of different 
arrays are properly aligned. 

Figure 1 pictures the distribution of the z-transformed measurements for some 
typical arrays. The probability density functions of the distributions are shown using 
the PDE method described in Ultsch (2003). It can be seen that the arrays are not 
centered at zero and that the variances are not identical. Reason for this is that the 
extreme values, i.e., over- resp-. under-expressed genes distort the calculateion of 
means and variances. Therefore a different procedure for the alignment of the ranges 
of different microarrays is proposed. In most microarray experiments the majority 
of genes are not influenced by the experimental conditions. Most measurements 
can therefore be attributed to some unspecific binding (USB) effects. Genes that 
are neither over- nor underexpressed (i.e., unexpressed genes) randomly bind to 
the microarray spots. If there is no systematic error in the experimental setup, this 
distribution is a Gaussian N(mUSB,sUSB). The parameters mUSB and sUSB are 
estimated by the Expectation Maximum (EM) procedure. For EM see, for exam- 
ple, Bilmes (1997). Robust estimations for mean and s.d. are used as a starting 




Fig. 1 Probability distributions of some z-transformed arrays 
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Fig. 2 Unit transformed data for the same data as in Fig. 1 



point for the EM optimization steps. The unit- or, abbreviated to u- transformation 
(Ultsch, 2005) is a variant of a z-transformation as follows 

The u-transformation standardizes the values such that the distribution of the 
unspecific binding effects (USB) is transformed to a standard normal distribution 
(N(0,1)). Formula for u-transformation: 

^ - mu SB 
u = . 

SuSB 

The u-transformation has the effect that the Gaussian resulting from unexpressed 
genes forms a standard normal (N(0,1)) transformation. Figure 2 shows the result 
of this u-transformation on the same data as in Fig. 1. The dashed line in Fig. 2 is a 
standard normal distribution (N(0,1)). 

Note that EM is a gradient descent algorithm and may converge to a local min- 
imum. The results of the EM algorithms need therefore to be verified. PDE plots 
as used in Eigs. 1 and 2 and also QQ-plots are suitable tools for this cross checking 
(see Fig. 4). 



4.2 Modeling Expressed Genes as Log Normals 

Microarray data results from a mixture of three types of genes: unexpressed genes, 
underexpressed genes and overexpressed genes. The distribution of unexpressed 
genes is a Gaussian, if no systematic measurement error is present. For u- 
transformed data this Gaussian is N(0,1). For PUL the under- and overexpressed 
genes are modeled by a log-normal distribution. The distributions are fitted to 
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Fig. 3 Mixture model optimized by EM, and compared to the empirical distribution 




Fig. 4 QQ-plot of the data vs. the mixture model 



the data using EM. Figure 3 shows an example of the u-transformed data and the 
Log-Gauss-Log mixture model. 

The quality of the mixture model can he estimated hy a quantile/quantile plot 
(QQ-plot). The percentiles of the empirical distribution are compared to the per- 
centiles of the mixture model. As can be seen in Fig. 4 the QQ-plot forms a straight 
line. This indicates that the log-Gauss-log mixture model is appropriate. 
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With the distributions calculated as described in the last subsection, the probability 
of a gene g belonging to one of the three sets (unexpressed, underexpressed and 
overexpressed) can be calculated using the theorem of Bayes. The PUL value of a 
gene g is defined using these posterior probabilities: 

PULvalue(g) = p(g | overexpressed) - p(g| underexpressed). Where p(g| 
overexpressed) is the posterior probability calculated using Bayes theorem on the 
model developed in the previous subsection. Analogously for p(g|underexpressed). 
The maximum PULvalue is 1 , meaning that a measurement is from an overexpressed 
gene. The minimum PULvalue is — 1 , meaning that a measurement is from an under- 
expressed gene. PULvalues of 0 indicate that a measurement is from an unexpressed 
gene. This is particularly well suited to calculate differences among the different 
experimental populations. A difference of zero means no differential expression. 
An absolute difference around 1 means that genes are over or underexpresed in one 
population and unexpressed in the other. The maximum absolute difference is 2. 
This indicates a change from over- to underexpression or vice versa. 



4.4 Gene Scoring in PUL 



The gene list produced by PUL is sorted according to the PULscore, which is cal- 
culated as follows: for a gene g let mi he a robust estimation of the mean PULvalue 
and i’l be a robust estimation of the s.d. of the PULvalues in population 1. For pop- 
ulation 2 m 2 , and S 2 are obtained analogously. The relative sizes of the populations 
are wi and W 2 - The PUL-score of gene g is then: 



PULscore (g) = 



|mi(g)-m2 (g)| 

(wi^i (g) -H W 2 S 2 (g)) + 1 ' 



PULscore is comparable to t-statistics. Instead of the differences in mean of the 
data, the difference in posterior probabilities (PULvalues) is used. As mentioned 
above, for definitive un- under- or overexpressed genes these PULvalues are 0,-1 
and 1. The denominator for the differences in mean is basically the average s.d. in 
the different populations plus the difference from unexpressed to expressed genes 
(=1). If the sum of s.d. in the populations is zero, then the PULscore is the absolute 
value of the differences in means. 



5 IR Methods for the Evaluation of DE Algorithms 

All the procedures described above for the identification of differentially expressed 
genes (ZT, SAM, PAM, PUL) produce lists of genes sorted according to a score. 
High values of the score should indicate that a there is a difference in gene 
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Fig. 5 Recall and precision as a function of gene list length s for different algorithms to find DE 
genes 



expression in the two experimental populations. E.g., tumor stage 1 vs. tumor stage 
4 in the NBD data set. Low values should indicate that a gene is not differential 
expressed. Here an information retrieval (IR) approach is followed to measure the 
quality of the different algorithms. The gene lists produced hy the different proce- 
dures can be regarded as results of a query for relevant genes in the set of all genes. 
In the benchmark data set there relevant are the 96 definitive differentially expressed 
genes. All other genes are irrelevant, some of them, however expose randomly some 
differences in expression (borderliners). For a number of s genes in a gene list, the 
information retrieval measurements of recall and precision can be calculated. Recall 
is the fraction of the relevant genes with respect to all DE genes. Precision is the 
fraction of differential expressed genes to the list length. A recall of 100% can be 
easily achieved. Just take a list of all genes. Therefore recall alone is not enough. A 
perfect algorithm has both precision and recall of 100%. The advantage of this IR 
approach is, that the performance for all lists of length 5 can be compared with a 
precision/recall graph (see Figs. 5 and 6). 



6 Results 

Figure 5 shows the precision/recall graph for the methods ZT, B-statistics, SAM, 
PAM and PUL. Note that the best performance is the right upper corner and the 
worst is the left lower corner. The points with the correct number of relevant genes 
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strue(=96), 2*strue(=192) and s = 10% of all the genes (=448) are marked with a 
circle in Fig. 5. Interestingly there is a clear hierarchy in performance of the algo- 
rithms. PAM has the weakest performance. About 25% more in both Precision and 
Recall is delivered by the simple t-statistics (ZT). Next better is B-statistics with 
about 10% increase performance in Precision and Recall compared to ZT. About the 
same increase in performance compared to B-statistics is achieved by SAM. PUL 
outperforms SAM with a factor of 2 respectively a 40% increase in Precision and 
Recall. Note that the results reported here are independent of the correct estimation 
of a cut-off value for differentially expressed genes. 

If the number of differentially expressed genes is estimated correctly (strue=96), 
PUL has a recall and precision in the 80-90% range, whereas SAM retrieves about 
50% of correct genes and 50% irrelevant genes. If the true number of relevant genes 
is overestimated by a factor of 2 (^=192), PUL found more than 90% of the dif- 
ferential expressed genes with a precision of about 50%. SAM as the second best 
algorithm retrieved only 65% of the differential expressed genes with precision as 
low as 33%. PUL identifies the first 80% of the differential expressed genes right 
away. The same Recall is obtained by the other algorithms only with a far bigger 
number of genes {s > 300). This implies that a large number of false positives are 
found. If the number of genes under consideration is rather large, for example 10% 
of the genes (^=448), PUL found practically all differentially expressed genes. In 
this case SAM retrieves about 90% of the relevant genes, ZT retrieves less than 70% 
and PAM less than 30%. 

In order to control the false discovery rate (FDR), adjusted /(-values (APV) 
can be calculated for all algorithms. In Fig. 6 adjusted /(-values (APV) are shown 
as result of 500 permutations of the arrays. The cutoff value for FDR is set at 



Adjusted p values < 5%, 500 permutations 




Fig. 6 Precision/recall graph with adjusted /(-values for false discovery rate (FDR) control 
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0.05. Practically the same hierarchy in performance as in the previous hgure can 
be observed: PUL ::?> SAM > B-statistics > ZT ^ PAM . ZT outperforms 
B-statistics, if the right number of relevant genes is overestimated. 



7 Applications of PUL 

The PUL method has been applied to two different real experiments with data 
from different sorts of microarray technologies. One application is the prediction 
of survival in neuroblastoma tumors. As described by Gebhard et al. (2009) differ- 
entially expressed genes between tumor stage 1 and tumor stage 4 were identified 
by PUL. The results were 18 DE genes. These 18 genes significantly discriminated 
a favorable from an unfavorable prognosis in stage 3 neuroblastoma. Stage 3 is clin- 
ically characterized by an intermediate prognostic outcome. In these experiments 
tow color spotted c-DNA microarrays were used. The 18 genes identified by PUL 
could be confirmed to be useful for prognosis of the tumor. This leads to different 
treatment options for stage 3 patients. In particular unnecessary aggressive therapies 
can be avoided. 

PUL was also applied to Bead Microarray data. In Pallasch, Schwamb, Schulz, 
Konigs, Debey, et al. (2008) the results of an experiment for the diagnosis of chronic 
lymphatic leukemia (CLL) are reported. B-cell receptor stimulation (BCR) of CLL 
cells were compared to CD5-I- normal B-cells. PUL analysis showed that lipopro- 
tein lipase (LPL) is induced by BCR stimulation. Application of a lipase inhibitor 
resulted in induction of apoptosis in primary CLL B-cells while no apoptosis was 
induced in healthy populations. The key role of the lipid metabolism in CLL patho- 
genesis was found by PUL. Independent in-vitro data confirmed a potential for a 
therapeutic use of a lipase inhibitor. This means that the PUL method was also 
used successfully for a different type of microarrays on a different problem. The 
hypothesis suggested by PUL could be verified independently in both cases. 



8 Discussion 

For the benchmark data set NBD all algorithms agree on the primary 16 genes to 
be dehnitive DE. This clear distinction between the two populations was succes- 
sively reduced on the five replicates of these genes. Furthermore 645 genes with 
borderline DE were included in NBD. This has the result that there might be bor- 
derline genes in NBD with differential properties that surpass the 96 presumably 
DE genes. Therefore it may be impossible to obtain Precision and Recall values of 
100% with any algorithm. On the other hand, NBD is hardly influenced by a partic- 
ular method to calculate statistics for DE genes. This allows a rather fair comparison 
of the algorithms. 

The comparison of data from different experiments is a necessary prerequisite to 
make population wide measurements like averages or variances. Adjusting the data 
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to overall measures like total means and total variance is a problem for leptokur- 
tic data. The commonly used z-transformation fails to align the data correctly (see 
Fig. 1). The u-transformation ensures that the distribution from the majority of unex- 
pressed genes (USB) is projected to a standard normal distribution (see Fig. 2). For 
u-transformed data it is feasible to find a suitable mixture model. The main part 
of the mixture, i.e., the unexpressed genes is defined. A log normal model for the 
over- and underexpressed genes seems to be appropriate as the QQ-plot shows (see 
Fig. 4). The alternative would be to use Gaussians for the over- and underexpressed 
distributions. If the variances of such Gaussians are chosen big enough to account 
for the large data they influence strongly the USB distribution. Log models on the 
other hand are asymmetric such that zero probabilities result in the central USB 
distribution (see Fig. 3). Small variances of such Gaussian models would avoid this 
effect. This, however, does not model the large values in the data. 

Microarray experiments often suffer from the problem of very few case data in 
the populations. The distribution of the measurements makes it difficult to prop- 
erly estimate means and variances for a single gene. The scoring for the gene lists, 
however, is critical dependent on good estimates of these parameters. The key prob- 
lem here is that the absolute measurements for over- or underexpressed data may 
be arbitrarily large. The transition from data to posterior probabilities projects the 
data to the interval [—1, 1] (see Ultsch, 2007). If there are errors in a measure- 
ment these errors have only limited influence for PUL’s scores. For the t-statistics 
and B-statistics, however, the measurement errors may dominate. Measurements 
with large absolute values bias the t- and B-statistics towards an overestimation of a 
gene’s relevance. The calculation of adjusted /i-values is also biased by such mea- 
surements. The results of the comparison show a performance ranking independent 
of the number of genes considered to be differentially expressed. 

For PAM it is claimed by the authors to deliver more appropriate results than 
SAM (see PAM’s webpage). This claim can be confirmed only for the first 2% of 
genes in the list. For a gene list length greater or equal to the right number of genes 
sought, PAM performed considerably worse than the other algorithms. 

The essential parameters for PUL are the parameters of the mixture model. These 
parameters are estimated from all the measurements on an array. A quality control 
of these estimations is necessary. This can be obtained using probability density 
plots of the model vs. the data (see Fig. 3) with the PDF method described in 
Ultsch (2003) and/or QQ-plots (see Fig. 4). 

All methods to find differentially expressed genes are based on theoretical 
assumptions that are usually not fulfilled for experimental data. All methods are 
therefore just heuristics to point out interesting hypothesis which should be con- 
firmed with independent experiments. 

In this paper a systematic comparison of the performance of the algorithms was 
undertaken. A benchmark data set with known properties allows the application 
of established methods for the assessment of quality. The newly proposed algo- 
rithm PUL was shown to outperform other algorithms by a substantial factor. In 
two practical applications with different types of microarrays (spotted and bead) 
practical results with external confirmation could be obtained. In the case of the 
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neuroblastoma data, the survival prognosis was successful. In the CLL case PUL 
pointed out a regulatory pathway which could be confirmed in in-vitro experiments. 



9 Summary 

A novel algorithm is proposed for the identification of differentially expressed genes 
in two group microarray experiments. The algorithm, called PUL, is compared 
to other popular algorithms using published implementations. The comparison is 
based on measurements used in information retrieval (Recall and Precision). The 
advantage of this approach is that the gene lists produced by the algorithms can be 
compared for all list lengths. Surprisingly, a clear ordering in performance of the 
algorithms was observed: 

PAM « ZT < B-statistics < SAM = PUL. The same ordering was obtained when 
adjusted /7-values were used for the improvement of False Discovery Rates. PUL 
outperformed other algorithms by a factor of two. PUL was applied successfully in 
different practical applications. For these experiments the importance of the genes 
found by PUL were independently verified. 
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Data Compression and Regression Based 
on Local Principal Curves 



Jochen Einbeck, Ludger Evers, and Kirsty Hinchliff 



Abstract Frequently the predictor space of a multivariate regression problem of 
the type y = ni{x\, . . . ,Xp) + e is intrinsically one-dimensional, or at least of 
far lower dimension than p. Usual modeling attempts such as the additive model 

y = ni\(xi)-\ \-mp{Xp) + €, which try to reduce the complexity of the regression 

problem by making additional structural assumptions, are then inefficient as they 
ignore the inherent structure of the predictor space and involve complicated model 
and variable selection stages. In a fundamentally different approach, one may con- 
sider first approximating the predictor space by a (usually nonlinear) curve passing 
through it, and then regressing the response only against the one-dimensional pro- 
jections onto this curve. This entails the reduction from a />- to a one-dimensional 
regression problem. 

As a tool for the compression of the predictor space we apply local principal 
curves. Taking things on from the results presented in Einbeck et al. (Classification - 
The Ubiquitous Challenge. Springer, Heidelberg, 2005, pp. 256-263), we show how 
local principal curves can be parametrized and how the projections are obtained. 
The regression step can then be carried out using any nonparametric smoother. We 
illustrate the technique using data from the physical sciences. 

Keywords Dimension reduction • Principal component regression • Principal 
curves ■ Smoothing. 



1 Introduction 



Principal curves are “smooth one-dimensional curves passing through the middle 
of a p -dimensional data set, providing a nonlinear summary of the data” (Hastie 
& Stuetzle, 1989). Since Hastie and Stuetzle’s pioneering work, principal curves 
have been further investigated, applied, and developed by quite a few researchers. 



Jochen Einbeck (El) 

Department of Mathematical Sciences, Durham University, Durham, UK, 
e-mail:jochen.einbeck@ durham.ac.uk 



A. Fink et al., (eds.). Advances in Data Analysis, Data Handling and Business 
Intelligence, Studies in Classification, Data Analysis, and Knowledge Organization, 
DOI 10.1007/978-3-642-01044-6-64, © Springer- Verlag Berlin Heidelberg 2010 



701 



702 



J. Einbeck et al. 



and today exist at least half a dozen of algorithms for estimating them. These differ 
essentially in (1) what is understood of the “middle” of the data cloud; (2) the algo- 
rithmic family (“top-down” or “bottom-up”); (3) the criterion used for minimizing 
the error (if used at all). 

Among the various principal curve concepts proposed are bias-corrected versions 
of the HS algorithm (Banfield & Raftery, 1992; Chang & Ghosh, 1998), the polygo- 
nal line algorithm (Kegl, Krzyzak, Linder, & Zeger, 2000), the “principal curves of 
orientated points” (PCOPs, Delicado, 2001), and the “local principal curves” (LPCs, 
Einbeck, Tutz, & Evers, 2005b). PCOPs and LPCs are bottom-up algorithms, i.e., 
they proceed through the data cloud step by step and do not minimize a global error 
criterion. All other existing methods correspond to top-down algorithms, meaning 
that they start with some initial line which is then iteratively dwelled out until it hts 
satisfactorily through the data cloud and some global error criterion is minimized. 
Apart from the LPCs, which aim to approximate the density ridge, all concepts 
assume the existence of some theoretical “true” principal curve. Implementations of 
all algorithms mentioned above are publicly available and have been applied to a 
wide range of problems, including the recognition of hand-written characters (Kegl 
& Krzyzak, 2002), the reconstruction of river outlines or coastlines (Einbeck, Tutz, 
& Evers, 2005a, 2005b), and path estimation from GPS tracks (Brunsdon, 2007). 

Surprisingly, the existing literature seems to be happy with knowing that princi- 
pal curves can be estimated and that the resulting curve can be visualized, but has 
not proceeded with exploiting its benefits once it is there (with the notable exception 
of Chang & Ghosh, 1998, who make use of HS principal curves for further pairwise 
compression of principal component scores). The value of their parametric coun- 
terpart, principal components, also brings to bear only when they are used for data 
compression or regression (e.g., Hastie, Tibshirani, & Eriedman, 2001, p. 66). 

In Sect. 2, we consider a simple example taken from traffic engineering, illustrat- 
ing how principal curves may be used for data compression and decompression. To 
motivate the necessity and value of nonparametric dimension reduction techniques, 
we proceed in Sect. 3 to a more complex application involving high-dimensional 
data from the future Galactic survey mission GAIA, and show how principal curves 
can be used for dimension reduction in multiple regression problems. In both cases, 
the technique used is that of local principal curves. We finish with a brief outlook 
on the extension to principal manifolds in Sect. 4. 



2 Data Compression with Local Principal Curves 
2.1 Local Principal Curves 

Assume we are given a data set Xi, . . . , X„, with X, e the intrinsic structure of 
which is to be described. Local principal curves (Einbeck et ah, 2005a, 2005b) are 
based on the idea that, at each point x e along a principal curve, the localized 
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first principal component line forms approximately a tangent to the curve. They can 
be seen as a simple and fast approximation to the mathematically and computation- 
ally more demanding PCOPs (Delicado, 2001). Beginning at some starting point 
X = Xo, LPCs work successively through the data cloud, alternating between the 
following two steps: 

(1) Calculate a localized center of mass /T’' = w, Z, , where 
Wi = KniXi - x)Xi/ KniXi - x). 

(2) Compute the C* local eigenvector of where = 

(^ij ~ — and denotes the y -th component of /r^'. Using 

a predetermined step size to, step from to x \= + toy^ ■ 

The sequence of the local centers of mass makes up the local principal curve. 

Here, Kh{ ) = with a multivariate kernel K and a positive 

definite bandwidth matrix H = diag(/tj, . . . , /t^. Extensions to disconnected and 
branched curves were considered in Einbeck et al. (2005b) and Einbeck et al. 
(2005a), respectively, and are easily implemented by using suitable multiple starting 
points. Crossings can be handled conveniently using an angle penalization (Einbeck 
et al., 2005b). As in each iteration only points in the local neighborhood are consid- 
ered, the algorithm is quite flexible, and, at the same time, robust to outlying data 
patterns. 



2.2 Simple Example: Speed-Flow Data 

Eigure 1 displays data recorded on the Californian freeway ER57-N on 9 July 2007. 
Each dot corresponds to the average of speed and flow values aggregated over 5- 
minute intervals. A LPC is fitted, using parameters h\ = h 2 = to = ^, and a 
starting point selected at random from the original data. The resulting points are 
symbolized by black squares in Eig. 1 . 

How does one go about connecting the points? For descriptive purposes a linear 
interpolation is sufficient, as it was handled in the original references (Einbeck et ah, 
2005a, 2005b). However, if the curve is to be used for further processing, it would 
need to be fully parametrized. One way of achieving this is to use a cubic spline (a 
piecewise polynomial function constructed from third order polynomials), yielding 
a continuous and differentiable smooth curve, as outlined below. 



2.3 Parametrizations and Projections 

For a fitted LPC consisting of L local centers of mass = jx^ = (fx \, . . . , yt^)^, 
I = 1, . . . , L, we seek a parametrization t such that the curve can be written as a 
function 
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Fig. 1 Speed-flow data (pluses) and principal curve (solid curve) with local centers of mass (filled 
squares) 

attaining the L points as outputs for certain parameter values t. Firstly, one end 
point is chosen to be the origin corresponding to f = 0. This is an arbitrary choice 
and we use the convention that t increases in the direction of . Technically, the 
curve is parametrized in three steps: 

(1) Compute a discrete, preliminary parametrization (i'^)(i<^<L), with the same 
origin as t, by adding up Euclidean distances between subsequent = 



(2) For each y = 1 lay a cubic spline through the set of points (if , < z, , 

yielding graphs ( 5 , )ij (i")). Putting them together, one obtains a continuous and 
differentiable spline function (pLi, . . . , jXp)^ (s). 

(3) Recalculate the parameter through the arc length of this spline function: 



It should be noted that no smoothing is involved in (2) - this is a purely mechanical 
step interpolating the through a string of cubic polynomials. 

Once that this parametrization is established, each data point X,, i = ,n, 

can be projected on the point of the curve nearest to it (in terms of Euclidean dis- 
tances), yielding the projection index f, . Data can be decompressed by evaluating 
the principal curve /, represented through the /(-dimensional spline function, at f, . 

An illustration is given in Eig. 2. Note that, though the parametrization is unit- 
speed (i.e., distances in parameter space correspond to distances in data space along 
the principal curve), the projections are not topology-preserving: data points which 
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flow (veh/2.5min) 

Fig. 2 Speed-flow data with principal curve (solid) and projections (dashed lines) 



are neighboring in data space are not necessarily neighboring in parameter space. 
This is a general property of data compression through principal curves, which dis- 
tinguishes such methods from topology-preserving, but less interpretable mappings 
(Pena, Barbakh, & Fyfe, 2008). 



3 Regression with Principal Curves 
3.1 GAIA Data 

GAIA is an astrophysics mission of the European Space Agency (ESA). A satellite 
is to be launched in 2011 which will undertake a detailed survey of over 10® stars 
in our Galaxy and extragalactic objects. The aims of the mission are, among others, 
to classify objects into stars, galaxies, quasars, etc., and to determine astrophysi- 
cal parameters (“APs”: temperature, metallicity, gravity) from spectroscopic data 
(photon counts at certain wavelengths) (Bailer-Jones, 2002). Yet, one has to work 
with simulated data generated through complex computer models. Figure 3 gives an 
example for a set of n = 8,286 16-dimensional photon counts simulated from APs 
through computer models. 

Note that, for the actual estimation problem, the photon counts form the predic- 
tor space and the APs form the response space, this is opposite to the direction of 
simulation. As a consequence, the regression problem may be degenerate (i.e., one 
set of photon counts may be associated to two different APs). In the following, we 
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will focus on the temperature, which features the least amount of degeneracy. We 
use a sample of size n' = 1,000 from the original data for all following calcula- 
tions. Fitting a multiple linear regression model for temperature against the sixteen 
individual photon counts leads to a residual standard error of 1,978 on 983 degrees 
of freedom, with t-values for all variables around 0.65 and corresponding /i-values 
around 0.51. We conclude that this does not constitute a useful model for the data. 



3.2 Principal Component Regression 

The usual remedies in this case are model/variable selection procedures or dimen- 
sion reduction techniques. The second one is obviously the most promising here. 
A common starting point for the application of the latter is the scree plot (Fig. 4), 
indicating that at most three components (these explain 98.9% of the total variance) 
appear to be sufficient to capture the information provided by these data. The usual 
way to continue is then to regress y = temperature against the scores associated 
with the largest three principal components, i.e. 

y = Po + /liscorei -I- j 62 Score 2 -I- /Isscores + e. (1) 

Fitting this trivariate linear regression problem leads to a residual standard error of 
2,060 on 996 degrees of freedom, with /(-values < 2e — 16 for all four regression 
parameters. The residual standard error of this model is naturally larger than the 
previous one, being just an approximation of the full linear model based on 98.9% 




Fig. 4 Scree plot for GAIA data 
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Fig. 5 Left: Scatterplot of first thi'ee principal component scores with local principal curve {dashed 
line). The less intense the grey tone, the larger is the temperature; right: temperatures fitted vs. 
projection indices 



of the available information. Nevertheless, this model is the more appropriate one. 
It remains the question whether the first three PC scores still feature some inner 
structure which we could exploit. 



3.3 Dimension Reduction with Local Principal Curves 

To investigate this, we produce a three-dimensional scatterplot of the PC scores, and 
shade lower temperatures with darker grey tones (Fig. 5 left). Clearly there is some 
curvilinear inner structure, which is informative for the target variable, temperature. 
Hence, the following is to do: 

(1) Fit a principal curve through the three-dimensional data cloud of PC scores. 

(2) Parametrize the principal curve and project all data points onto it. 

(3) Fit temperature (or other APs) against the (one-dimensional) projections. 

For task (1), a LPC is straightforwardly fitted' (Fig. 5 left). Alternatively, any 
other principal curve algorithm which provides access to the parametrization and 



'Using the default settings of R function Ipc for the parameters; these are: hj = 1/10 x 
{range of variable j }, and to = (1 /d) Ylj h j ■ 
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allows for continuous projections could be used. This would include the HS algo- 
rithm, as far as it copes with the complexity of the data in itself. Algorithms based 
on piecewise line segments as in Kegl (2000) are rather problematic for this purpose 
as projections tend to be clustered around the knots, unless the procedure outlined 
in Sect. 2.3 is additionally applied to them. 

We perform task (2) as described in Sect. 2.3 and plot temperature against the 
projection indices. In (3), we are left with a simple one-dimensional nonparametric 
regression problem y, = m(f, ) + Si . We used penalized smoothing splines to fit this 
model but any nonparametric smoother could be used. The smooth fit is shown in 
Fig. 5 (right). 



3.4 Direct Local Principal Curve Regression 

One may be wondering if there is a shortcut to this. Instead of the two-stage strategy 
“PC-t-LPC” used so far, one could consider to fit the local principal curve directly 
through the n' x 16 dimensional photon counts, as shown in Fig. 6. Comparing 
this result cursorily with Fig. 3, it appears that the data are reasonably represented 
(for a more quantitative evaluation of the accuracy of a principal curve, a coverage 
measure is available Einbeck et al., 2005b, and for the assessment of its precision 
bootstrap methods may be applied Brunsdon, 2007). Indeed, the one-stage strategy 
is feasible in principle, and the results for both strategies are quite similar. However, 
as data gets sparse in high dimensions, the LPC may miss remote parts of the pre- 
dictor space (the previously mentioned robustness may backfire here), which then 
get inadequately projected. The consequence of this is an increased sensitivity of 
the 16-dimensional LPC to the choice of the starting point compared to the three- 
dimensional one. When approximating data through PCA in a first step, data are far 
less sparse in the second. Principal components cannot miss isolated data points as 
PC lines can be thought of as being inhnitely long. 



3.5 Prediction and Comparison 

For a new observation x„e„ (i.e., here, a new set of spectra), prediction proceeds as 
follows: (1) Project x„ew onto the LPC (either in one or two steps), giving tnew (2) 
Compute from the fitted nonparametric smoother (hereafter: NS). 

Table 1 shows prediction errors for each 200 observations sampled from the train- 
ing data set and the remaining n —n' = 7,286 data points, respectively. Beside the 
methods discussed so far, we include an additive model using PC scores (a model 
just as in (1), but with all linear terms replaced by smooth functions; hereafter: AM). 

As expected, and mentioned earlier, PC-I-LM is slightly worse than LM, and 
obviously PC -I- AM is better than PC-I-LM. The three nonparametric approaches 
clearly beat the parametric ones. The best median of squared residuals is taken by 
PC -b LPC -b NS, which is of a similar magnitude as that for LPC-bNS and PC-b AM. 
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Fig. 6 Paimise plot of LPC fitted through 16-dimensional photon counts 
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Table 1 Prediction errors (/lO^) in comparison, e, is the difference between true and predicted 
temperature (LM Linear Model, PC Principal components, AM Additive model, NS Nonparametric 
Smoother) 







LM 


PC-I-LM 


PC -PAM 


PC-PLPC-I-NS 


LPC-PNS 


Training 


Average (e^) 


4,119 


4,395 


1,318 


2,633 


2,215 


data 


Median (e?) 


1,035 


1,300 


123 


51 


66 


Test 


Average (e^) 


6,393 


6,743 


2,054 


5,695 


4,667 


data 


Median (e?) 


723 


808 


147 


45 


46 



The mean of the squared residuals falls behind for the LPC-based methods com- 
pared to PC-I-AM. This can be explained as points close to the “end” of the data 
cloud are all projected onto the endpoint of the LPC, which leads to a degeneracy 
at either t = 0 or t = tmax (or both). This is visible in Figs. 2 and 5 (right). So, 
though the LPC-based methods work very well for the large bulk of the data, they 
do not handle the few points close to the endpoints of the principal curve very well. 
Artificially extrapolating the fitted LPC beyond its natural endpoints may help to 
solve this problem. 



4 Outlook 



Local principal curves are well suited to compress complex high-dimensional data 
structures, as long as the intrinsic dimensionality of the data cloud is close to one. 
When the intrinsic dimensionality is two or larger, the extension to local principal 
manifolds should be considered. In particular, the GAIA data may be better approx- 
imated by a two-dimensional principal surface. This would be particular helpful for 
the prediction of other APs as gravity or metallicity, information on which tends to 
be orthogonal to the principal curve approximating the predictor space. The work on 
extending LPC methodology to higher-dimensional structures is currently ongoing, 
based on the idea of replacing the building block “localized principal component” 
by suitably orientated triangles or tetrahedrons. 
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Optimization of Centrifugal Impeller Using 
Evolutionary Strategies and Artificial Neural 
Networks 



Rene Meier and Franz Joos 



Abstract In order to optimize turbo machine components it is necessary to describe 
the behaviour of multimodal objective functions (OF). Instead very time-consuming 
evaluations using a three-dimensional Navier-Stokes solver have to be performed 
to get the characteristics of these OF. In this study an Artificial Neural Network 
(ANN) is considered to use it as a performance predictor with the view to replace 
the evaluation of the objective function to speed up the optimization process. 

Keywords Artificial neural networks ■ Centrifugal impeller ■ Evolutionary strate- 
gies • Resilient backpropagation. 



1 Introduction 

In terms of decreasing resources of fossil fuels it is necessary to increase the 
efficiency of turbo machines. Therefore present design procedures are using Compu- 
tational Fluid Dynamics (CFD) to minimize development expenses. The complexity 
of CFD problems has increased tremendously in terms of physic-based models, 
multiscale mathematics, robust solvers as well as complex geometries applica- 
ble to almost all application domains. Almost always the objective functions are 
multimodal with large number of variables. Optimization processes using flow 
simulations are very time consuming because a huge number of evaluations are 
needed. For optimization procedures of flow channels evolutionary algorithms 
(Weicker, 2002) are a good possibility to get the global optima with appropriate 
number of objective function evaluations. In this study a geometry optimization 
with Evolutionary Strategies (ES) of a centrifugal impeller is considered (Joos and 
Bartold, 2008). The flow channel geometry which is called individual in terminol- 
ogy of Evolutionary Algorithms the evaluation of the isentropic efficiency still needs 
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Fig. 1 Centrifugal impeller 



a large computational effort. The continuing idea is how to minimize the optimiza- 
tion procedure. One option to decrease this effort is the use of ANNs. In this study 
the knowledge of performed flow simulations is to be used to create a database by 
an ANN to predict the objective function values for following impeller geometries. 



2 Optimization of a Centrifugal Impeller Geometry 

An existing centrifugal impeller was optimized to increase the isentropic efficiency. 
It is an open three-dimensional backswept impeller with 16 blades. The mass flow 
is 5.8kgs“\ the total pressure ratio is 1:1.6 and the impeller rotates with 240s“^ 
For flow analysis some boundary conditions are necessary, so the total temperature 
and total pressure at impeller inlet is predetermined as well as the static pressure at 
impeller outlet. 

In Fig. 1 the impeller is shown with the flow channel between two adjacent 
blades. The impeller blades can be described by their camber surfaces because they 
are considered as infinitesimal thin blades. One channel is sufficient because the 
optimization is done at a steady working point and so periodic boundary conditions 
can be set and only one segment of 22.5° within the impeller has to be simulated. 



3 Optimization Using Evolutionary Strategies 

During the optimization process with Evolutionary Strategies several geometries 
will be generated. In Fig. 2 the variable geometry of a flow channel is shown. The 
two camberlines along the points 1-2-3 and 4-5-6 can be described by the two 
cylindrical coordinates (j?, z)^. The z-coordinate represents the rotational axis and 
R stands for the radial coordinate. 
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Fig. 2 Flow channel of the impeller 



Each point on the camber line can be variegated by changing the z-coordinate 
and by adding At? which describes the angle deviation to the reference impeller. 
The z-coordinate at impeller inlet and outlet is kept constant. Because the impeller 
is to be ht in an existing housing, the radial distributions are also kept constant. In 
this case At?i, A& 2 , At? 4 , A &5 as well as zi and 25 are to be variable to describe 
several geometries. 

Depending on the interpolation points characterized by A&i and Zi the com- 
putational grid for flow simulation will be generated as described in Joos and 
Bartold (2008). For the optimization Evolutionary Strategies (Rechenberg, 1994; 
Schwefel, 1995) were used to And the maximum of the multimodal objective func- 
tion. In this case a (pL + X) — E S is used. Each individual is described by a vector 
of six geometrical parameters. In the initialization process a number of individu- 
als by uniformly distributed stochastic variation of the geometrical parameters are 
produced. 

In the recombination process information between the individuals will be exchan- 
ged. The mutation process is needed for exploration in the search space which is 
important for getting new knowledge as well as overcoming some local maxima of 
the objective function. After this the individuals has to be evaluated by the fully 
three-dimensional Navier-Stokes solver using the k — e-turbulence model. In the 
selection process the best individuals of the formerly generated individuals which 
consists of parents and X children, ji individuals with the highest efficiency are 
selected. This described procedure is passed through until a stopping criterion is 
fulfilled. 



4 Performance Predictions 

The use of evolutionary algorithms is a good choice with the view to minimize 
the number of performance evaluations. Nevertheless the evaluation of the three- 
dimensional Reynolds Averaged Navier-Stokes-Equations is still very time 
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Fig. 3 Optimization system 



expensive. In the present study an ANN is to be discussed whether to replace a 
CFD- solver with a view. 



4.1 Method 

The aim of this analysis is a possible application integrating the ANN into the 
optimization process (Fig. 3). In this case the ANN is to be served as a database 
were information about the relation between the geometry and performance of the 
impeller is stored. 

For the first initialization the geometry of the existing impeller is used. Several 
geometries will be produced with principles of mutation and recombination. For 
each one a flow simulation will be performed to evaluate the objective function 
value. If the stopping criterion is not fulfilled, the best individuals for a new gen- 
eration will be selected to start the optimization cycle again. If the database of the 
ANN is large enough it should be possible to displace the very time expensive flow 
simulation by the network. The evaluation by the ANN will probably not produce 
the solutions accurately but it should be possible to classify the quality of each new 
individual. It has to be an interaction in the upgraded optimization system between 
the computational fluid dynamics (CFD) solver and the ANN as performance pre- 
dictor. That means that predicted efficiencies during the optimization process have 
to be validated in some planned intervals by CFD simulations. If the predicted effi- 
ciencies are departed to much, the correct datasets are known at this time and these 
the new datasets can be added to the former training data set. So in this optimization 
there is a dynamic database which should be able to perform some more accurate 
predictions during the optimization. 
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Fig. 4 Artificial Neural Network: six input units, two hidden layers and one output unit 

4.2 Experimental Settings 

A simply feed forward connected ANN is used with the view to use it as an approx- 
imation tool. Six input units, two hidden layers and one output unit were selected 
for the present approximation problem. A training pattern set, which is called indi- 
vidual in terminology of evolutionary strategies consists of a six-dimensional input 
vector u and the one-dimensional output vector t. The geometrical parameters and 
the efficiencies were transformed to the interval [0.1, 0.9] (Fig. 4). 

The Resilient Backpropagation algorithm (Rprop) is used as a local adaptive 
learning scheme (Zell, Marnier, Vogt, Mache, Hiihner, et al., n.d.). Using the weight- 
decay term with the weight-decay parameter a is used to determine the relationship 
of two goals. The standard goal is to reduce the output error, the other one is to 
reduce the size of weights for improving generalization. In comparison with the 
standard backpropagation and the backpropagation with momentum term the Rprop 
needs less training cycles to find a minimum of the error function in this test case. 

To find out the best topology 5, 10, 15 and 20 neurons per layer were tested. Five 
and 10 neurons in two hidden layers did not minimize the error adequately. The 
choice of 20 neurons in each layer did not lead to a better generalization. 



4.3 Experimental Results 

In Fig. 5 the convergence history of the already performed optimization with evo- 
lutionary strategies is shown. To train and test the networks the first 50 individuals 
which were created during the optimization process (Joos and Bartold 2008) were 
used as a training data set to test the ANN. Different network topologies with differ- 
ent numbers of hidden units were used to test the ability of the ANN as performance 
predictor. Fifteen units in each layer are a good choice for the given approximation 
problem. 

After training procedure the next 100 individuals are to be used to get information 
about prediction quality of the network. In Fig. 6 the correct outputs evaluated by 
the network for the already 50 individuals are shown. Taking a look at the more 
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Fig. 5 Convergence history of optimization with evolutionary strategies 



Fig. 6 




Comparison of predicted efficiency (ANN) and simulation (CFD) 



interesting prediction area, different correlations between the correct performances 
and performances predicted by the network are identifiable. 

The parameter settings of the individuals which were created during the evolu- 
tionary algorithm tends to be more different because of proceeding exploration in 
design space so the very next predicted individuals 5 1-80 exhibit better correlations 
than individuals which geometrical parameters are more different to the training data 
set. In Fig. 7 the ratio of predicted efficiencies to evaluated efficiencies are shown. It 
is cognizable that correlations of individual 5 1 till 80 are better than following ones. 
The ratios of efficiencies of these individuals are all within the range of 0.999-1 .001 . 
The next individuals exhibit larger deviations. 



Optimization of Centrifugal Impeller Using Evolutionaiy Strategies 



719 



f^ANN 



^CFD Correlations 




individual 



Fig. 7 Ratio of predicted efficiencies to evaluated efficiencies 




Efficiency (evaluated by CFD) 



o Prediction □ Training Set 



Fig. 8 Correlation of prediction (ANN) and simulation (CFD) 



Concerning the 50 individuals who were presented to the network during the 
training phase the correlations of predicted and simulated efficiencies as shown 
in Fig. 8 are very exact. The individuals which were not presented to the network 
(the following 100 individuals) are characterized by the rhombi in Fig. 8. Sixty-five 
percent of the predictions have a deviation of less than 0.05% to the correct ones. 

Searching for options to increase the accuracy of the predictor some different sum 
of squared errors as lower bound for the training process were defined. In Figs. 9 and 
10 the prediction performance of two trained networks with different sum of squared 
errors can be compared. 

Taking a look at the training area in both charts there are very correct outputs 
evaluated by the network because these datasets were already presented to the net- 
work in training procedure. Although in the range of individual 20-30 there are 
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Fig. 9 Differences between prediction by ANN and CFD-Simulation, SSE < 0.01 
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Fig. 10 Differences between prediction by ANN and CFD-Simulation, SSE < 0.0001 

some small errors in prediction (Fig. 9) setting the SSE to 0.01. But this is not the 
achievement we are interested in. Instead the network should be able to make good 
generalizations for new input vectors. Comparing the prediction areas in Figs. 9 and 
10 leads to perception of contracting error at the larger SSE. This phenomenon is 
called overtraining which can be avoided by test and validations sets to find the right 
parameter settings for the learning process. 

In Fig. 9 it is obvious that an ANN is not able to make good predictions in design 
space areas which are unknown. The error is increasing the more the evolutionary 
process leads the parameter setting to unknown regions of the design space. The 
more the parameter settings differ to the training parameter set the worse is the 
generalization quality of the network. 
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5 Conclusion 

The use of ANNs has been presented to approximate the isentropic efficiencies for 
new generated geometries of a centrifugal impeller. The more the parameter settings 
of the new geometries are similar to the training pattern set the more accurate is the 
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generalization quality. So the performance predictor only should be used for the very 
next means similar parameter settings. That’s why after creating new individuals 
during the evolutionary process some validation simulation has to performed on the 
one hand to check the prediction quality of the network and on the other hand to get 
a larger database in either case. It is also necessary to work with test and validation 
sets during the training procedure of the neural network to find the right parameter 
settings to avoid overtraining. 
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Efficient Media Exploitation Towards 
Collective Intelligence 
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Sam Chapman, Fabio Ciravegna, Steffen Staab and Pavel Smrz, Yiannis 
Kompatsiaris, and Yannis Avrithis 



Abstract In this work we propose intelligent, automated content analysis tech- 
niques for different media to extract knowledge from the multimedia content. 
Information derived from different sources/modalities will be analyzed and fused, in 
terms of spatiotemporal, personal and even social contextual information. In order 
to achieve this goal, semantic analysis will be applied to the content items, taking 
into account the content itself (e.g., text, images and video), as well as existing per- 
sonal, social and contextual information (e.g., semantic and machine-processable 
metadata and tags). The above process exploits the so-called “Media Intelligence” 
towards the ultimate goal of identifying “Collective Intelligence”, emerging from 
the collaboration and competition among people, empowering innovative services 
and user interactions. The utilization of “Media Intelligence” constitutes a depar- 
ture from traditional methods for information sharing, since semantic multimedia 
analysis has to fuse information from both the content itself and the social context, 
while at the same time the social dynamics have to be taken into account. Such 
intelligence provides added-value to the available multimedia content and renders 
existing procedures and research efforts more efficient. 

Keywords Collective intelligence • Media intelligence • Social media. 



1 Introduction 

It is rather true that nowadays most of community-related knowledge and informa- 
tion originates from raw content, be it in the form of, e.g., text, images, video, or 
speech. Human annotation or tagging used in social networks is a way to represent 
or handle the underlying knowledge, yet despite the human intervention, content 
remains highly unstructured and it is quite difficult to extract semantics and corre- 
late to other sources of information. The term “Media Intelligence” is introduced 
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in this work and aims at the development of intelligent, automated content analy- 
sis techniques for different media to extract knowledge from the content itself. It 
contributes to the current state-of-the-art techniques in single modality content pro- 
cessing, and at the same time makes a significant step in proposing novel research 
methods of fusing information from different sources/modalities, contextual infor- 
mation (e.g., time, location, acquisition metadata), personal context (e.g., profile or 
preferences) and social context (tagging, ratings, group profiles, relevant content 
collections, etc.). 



2 Progress Over Related Scientific Work 

The main drawback of current multimedia content is the fact that it remains highly 
unstructured and it is quite difficult to extract semantics from it and correlate them to 
other sources of information. Consequently, we may identify a number of principal 
challenges to tackle within our work: 

• Semantic gap. Although description of multimedia information has seen sig- 
nificant progress, the pace of automatic extraction of such a description, and 
especially of its semantic part, is rather slow, due to the limitations of state of 
the art multimedia analysis systems. A “semantic gap” has been acknowledged 
between current multimedia analysis methods and tools on the one hand and 
semantic description and annotation methods and tools on the other (Smeulders, 
Worring, Santini, Gupta, & Jain, 2000). 

• Scope/domain generalization. Due to the above challenges, multimedia content 
analysis techniques are mostly being developed and tuned to a narrow application 
scope and are not extendible to other application contexts (Haralick & Shapiro, 
1993). In some cases, the methods do not even work on a test set covering 
multiple aspects of one and the same target application scope. 

• Heterogeneity of modalities. Multimodal processing (Maragos, 2004) and the use 
of contextual information (Boutell, 2006) have been recognized as key in deal- 
ing with the above challenges. However, text, image, video, speech processing 
and analysis are so diverse in nature that experts in different disciplines often 
have difficulties in reaching a common methodology when dealing with multiple 
modalities at the same time. 

• Unstructured data. The Web has emerged as a massive source of multimedia 
content and automated methods have appeared that collect and organize such 
content into ground truth data for research, vocabularies and so on (Popescu, 
Grefenstette, Millet, Moellic, & Hede, 2006). On the other hand, information 
extraction over such sparse, distributed, unstructured sources, integrating infor- 
mation from content with information from metadata and Web 2.0 tags as well is 
another challenge on its own. 
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Going a step further into detail and regarding text analysis, a number of chal- 
lenges exist when dealing with documents originating from user input and existing 
sources: 

1 . Ability to adapt large scale tasks and domains using limited user input; current 
methods designed for large scale mining are not applicable to the task, as they 
require redundancy of information (Ciravegna, Chapman, Dingli, & Wilks, 2004; 
Etzioni, Cafarella, Downey, Kok, Popescu, et ah, 2004) that is not present in 
many cases (e.g., in an “Emergency Response” case) where information may be 
scarce and not repeated. 

2. Exploitation of contextual information to limit the complexity of extraction, for 
example to help disambiguate information; in this work we will go beyond the 
current methods in that we will extend the concept of contextual information 
beyond the use of simple lists or gazetteers (as in Stevenson & Greenwood, 2006) 
or user interaction (Chakrabarti, Puniyani, & Das, 2006). The use of contextual 
information will be considered across media (where information in one medium 
helps disambiguating information in another medium) or by reusing existing 
meta-information about documents (creator, time, etc.) or the information looked 
for (background information). 

In this work we shall focus text analysis on user input coming from messages 
(e.g., multimedia messages, emails) and on the analysis of existing documents (web 
pages, blogs, RSS feeds and material at the user sites, e.g., PowerPoint presenta- 
tions, HTML documents). Technologies suitable for large scale processing of text 
in knowledge management environments (e.g., Iria & Ciravegna, 2006; Ciravegna 
& Lavelli, 2004) will be enhanced and adapted to the social web requirements of 
the current era. 

On the other hand, regarding image and video analysis, state-of-the-art tech- 
niques are frequently used for content-based retrieval by use of low-level features 
(Haralick & Shapiro, 1993; Haas, Lew, & Huijsmans, 1997; Foote, 1999) in con- 
junction with high-level human understandable concepts (Al-Khatib, Day, Ghafoor, 
&Berra,2005) for multimodal analysis. For instance, Blinkx (http://www.blinkx.corn) 
is a web platform that enables multimodal video retrieval based on embedded 
metadata, audio and video cues. Our work will extend such technologies and 
exploit state of the art techniques in single modality content processing and will 
make a significant step in researching novel methods of fusing information from 
diverse modalities (Maragos, 2004), contextual information (Boutell, 2006), per- 
sonal context (Vallet, Castells, Fernandez, Mylonas, & Avrithis, 2007) and social 
context. 

It will also utilize existing multimedia analysis approaches based on explicit 
knowledge that model visual features/structure, algorithms, domain concepts and 
rules guiding the analysis process (Dasiopoulou, Mezaris, Kompatsiaris, Papastathis, 
& Strintzis, 2005) and will advance the state of the art by providing intelligent exten- 
sions to knowledge-assisted analysis (Papadopoulos, Mylonas, Mezaris, Avrithis, 
& Kompatsiaris, 2006; Athanasiadis, Mylonas, Avrithis, & Kollias, 2007) based 
on visual and complementary contextual information (Mylonas & Avrithis, 2007) 
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derived from other modalities, metadata, personal and social context. It will extend 
them to better adapt to and beneht from the remarkable success of Web 2.0 
applications by: 

1. Formalisation of multimedia analysis based on user context (e.g., physical loca- 
tion, media acquisition conditions). 

2. Fusion of knowledge originating from different modalities and different analysis 
processes. 

3. Exploitation of the intelligence emerging from user groups in order to boost 
the performance of multimedia analysis and support novel content access and 
delivery mechanisms. 

4. Adaptation and combination of vocabulary-based telephone speech recogni- 
tion techniques and utilization of phonetic search that enables identification of 
proper names and usually unrecognized words, especially in the context of an 
emergency response case study. 

Regarding speech analysis, today’s standard vocabulary-based speech recogni- 
tion technology cannot provide sufficient accuracy for such cases; the word-error 
rate can be as high as 40% for noisy environments. Moreover, the tools can only rec- 
ognize a given set of words (from their limited vocabularies). They cannot deal with 
new names of persons, places, etc., that are crucial for real-life use cases. To address 
this situation, we will combine a vocabulary-based speech recognizer with a key- 
word spotting module implementing the functionality of the phonetic search, also 
supporting detection of OOV (out-of-vocabulary) words. The phonetic recognition 
can be more or less language independent but can also benefit from an identification 
of the language-specific set of phonemes. 

The proposed framework will provide a unique opportunity in exploiting 
challenging research directions such as using multimodal processing, contextual 
information, personal and social context, tags and other information to improve 
the performance of existing multimedia analysis methods. Most importantly, such 
analysis is not currently used in social network environments mainly due to its unre- 
liability. The greatest challenge will be to demonstrate that combined with metadata, 
tags and other information currently used, content analysis can provide a more pow- 
erful experience for the user. Success will highlight the value of content analysis 
in such environments, generate awareness and open the way to a number of future 
applications. 



3 Intelligent Media Analysis 

The main step towards efficient “Media Intelligence” concerns automated analysis 
and semantic extraction from raw visual, textual or audio content and associated 
metadata. Analysis focuses on each medium in isolation and without taking into 
account any contextual information or the social environment. However, it does take 
into account prior knowledge, either implicit, in the form of supervised learning 
from training data, or explicit, in the form of knowledge driven approaches. 
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Extracting knowledge from raw data is a huge research problem on its own so 
work in this field is expected to advance current existing state-of-the-art techniques 
for each medium, while a significant effort will be devoted on: 

1 . Adapting to the individual domains of interest and intelligence methodologies 

2. Handling heterogeneity of unstructured user-contributed content 

3. Supporting interoperability with contextual information 



3.1 Text Analysis 

Textual information is of fundamental importance in every scenario where humans 
are involved as it is the most common medium of communication. Unfortunately 
the ability humans have to understand and generate language is not matched by 
machines and, therefore, information in textual documents is generally unavailable 
for automatic processing. Textual information is pervasive and - in the digital era - 
its availability is increasing. Intelligent techniques are required to enable automatic 
Information Extraction (IE) from text and make this information available for fur- 
ther processing, e.g., to integrate with other sources or to proactively execute tasks 
(Fig. 1). 

The current research challenges for text analysis are: 

1. Information extraction over sparse distributed documents (e.g., the Web), inte- 
grating information from documents with information from metadata and Web 
2.0 tags as well. Technologies for large scale processing of text in knowledge 
management environments (e.g., Iria & Ciravegna, 2006; Ciravegna & Lavelli, 
2004) will be enhanced and adapted to the current social web requirements. 

2. Analysis of time and spatial information: Modelling information with a strong 
spatial and temporal connotation over a large scale is a complex and challenging 
problem and - to our knowledge - has been only partially coped with previously. 




Fig. 1 Intelligent text analysis paradigm 
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3. User profiling and monitoring as a way to empower and direct text extraction, 
i.e., strategies for focusing extraction and making sense of facts based on user and 
task profiling or in other words take personal context into account for multimodal 
media analysis. 



3.2 Visual Information Analysis 

Visual information, that is, still images and especially video, tend to impose huge 
requirements on current repositories or social network sites in terms of storage 
or transmission due to the size of the data involved, yet its contribution to the 
knowledge and intelligence of related applications remains insignificant. Research 
in disciplines like image processing, pattern recognition and computer vision has 
been ongoing for decades but satisfactory performance can usually only be achieved 
in constrained domains, scales and environments (Fig. 2). 

Our work leverages existing tools to handle well- specified problems like spa- 
tial/temporal decomposition and structuring, object/event detection, recognition 
and tracking, investigating appropriate visual features, metrics and supervised learn- 
ing approaches using annotated training data. It expands also existing approaches 
based on explicit knowledge that models visual features/structure, algorithms, 
domain concepts and rules guiding the analysis process (Dasiopoulou, Mezaris, 




Fig. 2 Intelligent visual analysis paradigm 
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Kompatsiaris, Papastathis, & Strintzis, 2005). In doing so, it employs state-of-the-art 
knowledge-assisted analysis tools and advances them by providing extensions to: 

1 . Support processing of incomplete, uncertain, partial or conflicting information 

2. Exploit visual context emerging from interaction between local and global pro- 
cessing 

3 . Support use of complementary contextual information derived from other modal- 
ities, metadata, user and social context 



3.3 Speech Analysis 

Speech is considered to be an important modality, e.g., in the case of emergen- 
cies, where due to the number of available resources, handling by humans is clearly 
inadequate and automated extraction of semantics is necessary. In this manner, 
a vocabulary-based speech recognizer will be combined with a keyword spotting 
module implementing the functionality of the phonetic search. The phonetic recog- 
nition can be more or less language independent (but can also benefit from an 
identification of the language-specific set of phonemes). The research agenda will 
focus on the detection of OOV (out-of-vocabulary) words in the output of the large 
vocabulary speech recognition module and on the advantages the phonetic recog- 
nizer can provide for their search. The development part of the task will concentrate 
on the generality of the designed interfaces to enable easy updates of the background 
recognition modules. 

In principle, response to emergency cases will benefit from speaker identification 
technology (and verification of the speaker identity). State of the art speaker iden- 
tification technologies will be examined and employed; the latter are mainly based 
on advanced feature extraction and machine learning techniques. Requirements for 
speech analysis related to visual and text analysis will also be investigated. 



4 Contextual Media Analysis and Fusion 

The aim of this section is to combine the semantics extracted from different modal- 
ities in a structured way, along with contextual information like time, location, 
or acquisition metadata, and personal context like user profile and semantic pref- 
erences. Fusion of heterogeneous information derived from different sources and 
modalities is a problem that has been long studied but only partially dealt with 
(e.g., still images with associated metadata, or video sequences with associated 
audio), mainly due to its multidisciplinary nature and the different requirements of 
each study. An integrated theoretical model will be developed in this task to handle 
fusion of textual, visual and speech semantics, coupled with contextual and personal 
information. Contextual information typically refers to metadata automatically cre- 
ated and stored with the content, like acquisition time, location (mobile cell, GPS 
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coordinates) and parameters, e.g., device or camera metadata (aperture, focal length, 
lighting conditions, flash use) for the still image case (Boutell, 2006). Such infor- 
mation may be available with the content itself (e.g., EXIF metadata for images) 
or in separate resources (e.g., MPEG-7 metadata for video). Such metadata can be 
valuable when combined in the analysis process and will be exploited to disam- 
biguate, resolve inconsistencies or complement missing information in simple tasks 
like indoor/outdoor classification. 

On the other hand, personal context (Mylonas & Avrithis, 2007), like the profile 
and the personal preferences of the user contributing a specific resource, provides 
additional evidence to assist content-based analysis of the resource, e.g., types of 
places one visits, style of writing and so on. Such contextual extraction mechanisms 
will be investigated, that are necessary for the creation and exploitation of personal 
context. All evidence will be taken into account both during multimodal media anal- 
ysis (early fusion) and as a post-analysis integration step (late fusion). To support 
the required information fusion processes, existing knowledge representation for- 
malisms and reasoning tools will be extended to support temporaEspatial analysis, 
media interpretation and information fusion under uncertainty, and incomplete or 
contradicting evidence. 



5 Social Media Intelligence 

It is expected that actionable knowledge can be extracted by analysing how multi- 
media content is shared, accessed, annotated and otherwise used by communities. 
The aim of this task is to exploit workspace statistics (e.g., access/usage history) 
and social content (tags, ratings, related content items, related users, group profiles, 
etc.) in order to improve the intelligent media analysis and use. This includes, e.g., 
analysing how images are stored or how frequently they are accessed in the context 
of a given task in order to learn what they have in common in terms of content. 

Knowledge about how users are interacting in a shared community and semantics 
of social interaction extracted from system usage will be investigated in this task. 
This kind of information will be exploited in the content analysis process; such an 
approach has not been explored yet, to our knowledge. This would permit, e.g., to 
analyze an image given the profile of the contributing user, other pictures in the same 
collection (e.g., from vacation), comments from his/her friends, or related pictures 
within the community. This is yet another information fusion task that will be carried 
out, where contextual information extends to include personal or social context. 
Collaborative computing techniques will be employed to enable collaborative media 
tagging, manipulation, or search over a social network. 

Knowledge extracted from the social content will be represented in a machine 
understandable way in order to achieve interoperability and knowledge sharing. 
Semantic Web technologies will provide the formal framework for the represen- 
tation and processing of syntax and semantics of the extracted social knowledge. 
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Existing reasoning engines will be investigated and used to process the extracted 
knowledge. Reasoning services are essential to check the consistency and the valid- 
ity of instances, to extract implicit knowledge using subsumption and equivalence 
relations defined in the ontologies and, finally to take decisions using inference 
rules. Existing work and standards will be examined in order to construct the social 
content ontology. Eor instance, Friend-Of-A-Eriend (FOAF) is a simple technology 
that makes easier sharing and using information about people and their activities 
(e.g., photos, calendars, weblogs), to transfer information between Web sites, and 
to automatically extend, merge and re-use it online. FOAF can be used as a starting 
point and extended where needed in order to support representation and exchange 
of knowledge with respect to content, metadata, users and community interactions. 



6 Conclusions 

In this work we proposed our initial research efforts and ideas in designing and 
implementing efficient, automated content analysis techniques for different media 
to extract knowledge from the multimedia content. Contextual information, in terms 
of spatial, temporal, semantic, personal and social information constitute the added 
value to current traditional media analysis approaches. Applying and exploiting this 
kind of additional information forms the way to efficiently advance the so-called 
“Media Intelligence” towards the ultimate goal of identifying “Collective Intelli- 
gence”, emerging from the collaboration among large communities and empowering 
innovative services and people interactions. Such combined intelligence provides 
added value to the available multimedia content and results into a more efficient 
rendering of procedures and research. 
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Multi-class Extension of Verifiable Ensemble 
Models for Safety-Related Applications 



Sebastian Nusser, Clemens Otte, and Werner Hauptmann 



Abstract This contribution discusses two different strategies for extending a veri- 
fiable ensemble approach for binary classification tasks to also solve multi-class 
problems. The binary ensemble approach was developed with the objective of 
providing interpretable classification models for use in safety-related application 
domains. It is based on low-dimensional submodels. Each submodel uses only a 
low-dimensional subspace of the complete input space facilitating the visual inter- 
pretation and validation by domain experts. Thus, the correct inter- and extrapolation 
behavior can be guaranteed. The extension to multi-class problems is not straightfor- 
ward because common multi-class extensions might induce inconsistent decisions. 
The proposed approaches avoid such inconsistencies by introducing a hierarchy of 
misclassification costs. We will show that by following such a hierarchy the exten- 
sion of the binary ensemble becomes feasible and the desirable properties of the 
binary classification approach for safety-related problems can be maintained. 

Keywords Ensemble learning ■ Interpretability • Multi-class ■ Safety-related. 



1 Introduction 

Safety-related systems are systems whose malfunction or failure may lead to death 
or serious injury of people, loss or severe damage of equipment, or environmen- 
tal harm. They are deployed, for instance, in aviation, automotive industry, medical 
systems and process control. Usually, it is not possible to rectify a wrong decision in 
these application domains. Thus, a classification task is often modeled as an imbal- 
anced problem: the default state of a safety-related system should only be changed if 
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Fig. 1 Counterintuitive extrapolation behavior in a region not covered by the data set. This two- 
class problem is solved by a support vector machine (SVM) with an acceptable classification 
performance on the given data. However, in a region not covered by any data the decision of 
the SVM changes arbitrarily 



there is strong evidence for an incident or a malfunction. On the other hand, discov- 
ering such an incident is important in order to avoid perilous consequences. To apply 
a machine learning method in this domain one must ensure that the solution is sensi- 
tive enough to discover failures hut the solution must also be robust enough in order 
to avoid false alarms. It is important to convince the domain experts that the learned 
solution solves the right problem (that is, the solution must be correct according 
to physical laws and must use proper system assumptions) and complies with the 
functional specifications. In practical application tasks, the available training data 
is often too sparse and the number of input dimensions is too large to sufficiently 
apply statistical risk estimation methods. In most cases, high-dimensional models 
are needed to solve a given problem. Unfortunately, such high-dimensional mod- 
els are hard to verify (curse of dimensionality), may tend to overfitting, and the 
interpolation and extrapolation behavior is often unclear or intransparent. An exam- 
ple of such counterintuitive and unintended behavior is illustrated in Fig. 1, where 
the prediction of the model changes in a region not covered by the given data set. 
Such behavior will even be more likely and much more difficult to discover in the 
high-dimensional case. Thus, machine learning methods - especially “black-box” 
approaches like artificial neural networks or support vector machines - are regarded 
with suspicion by engineers dealing with safety-related application problems. 

In a previous work (Nusser, Otte, and Hauptmann, 2007) we developed a sequen- 
tial covering algorithm for binary classification problems in safety-related domains. 
It is based on ensembles of low-dimensional submodels, where each submodel as 
well as the overall ensemble model can be verified and, thus, the correct interpo- 
lation and extrapolation behavior of the complete model can be guaranteed. In the 
present contribution we will extend this approach to multi-class problems. 

The paper is organized as follows: Sect. 2 briefly describes our binary ensem- 
ble framework for safety-related domains. We will point out problems of commonly 
used multi-class extensions in Sect. 3. In Sect. 4 our ensemble framework is extended 
to solve also multi-class problems and Sect. 5 concludes. 
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2 The Verifiable Ensemble 

Our ensemble framework (Nusser, Otte, and Hauptmann, 2007) is designed to 
solve a binary classification problem, that is, we want to determine an estimate 
of the unknown function / : ^ 7, where V" = Zi x X 2 x ■ ■ ■ x = 

nf=i V with X, e IR and Y = {0, 1}, given the observed data set 2) = 
.ki)^ 2 , yi), ■ ■ ■ ■. Cvm, ym)} C V" yiY . Our classification approach is moti- 

vated by Generalized Additive Models (Hastie & Tibshirani, 1990) and Separate- 
and-Conquer approaches (Fiirnkranz, 1999). It can be interpreted as variant of the 
projection pursuit (Friedman & Tukey, 1974; Huber, 1985). The extension of our 
approach to multi-class problems is discussed in Sect. 4. 

Basic Idea 

We are interested in determining an estimate of the unknown function / \ V” ^ Y . 
Instead of regarding the original high-dimensional input space H", we regard 
only two- or three-dimensional subspaces of K". Submodels are trained on these 
small subspaces. This facilitates the visual interpretation of the solution and, thus, 
unintended inter- and extrapolation behavior can be avoided. The submodels are 
combined to an ensemble of models to overcome the limited predictive performance 
of each single model. That is, our approach assumes that the original classification 
problem is separable into low-dimensional subproblems. 

Projections of High-Dimensional Data 

The projection jt maps the high-dimensional input space V to an arbitrary subspace 
Vfi. This mapping is determined by a given index set f C { 1 , . . . , n}. This index set 
defines the dimensions of V" that will be included in the subspace V^. Thus, the 
projection jt can be defined as 



where denotes the index set of the subspace where the classification error of 
the submodel gj is minimal. In order to determine the best projections, a wrapper 
method for feature selection (Kohavi & John, 1997) that performs an exhaustive 
search through all pairwise input combinations is used. For very high-dimensional 
data sets we advise to use a preceding feature selection (e.g., Guyon & Elisseeff, 
2006) for reducing the computational costs. The final function estimate / (V) of 




( 1 ) 



Submodels 



The j -th submodel is defined as 



gj ■ npjiy’') ^ T, 



( 2 ) 



736 



S. Nusser et al. 



Algorithm 1 Learning a Verifiable Ensemble 

input: data set XI, Cpref - label of default class, dimjimit - limit of dimensions (fixed) 
output: models - set of submodels 
function models := buildModel(X), Cpref) 

1: solve >>) e X) : min j| v — g(jr^(~i^))|}, jS C {1, . . . , n} s.t. \fi\ = dimjimit and 

Vv = Cpref -ly- = 0 

2: := = Cpref} 

3: if (X \ Ti„ew 7 ^ 0) then 

4: models := {g(jr^(-))} U buildModel(X„e„, Cpref) 

5: else 

6: models := 0 

7: end if 



the global model is determined by the aggregation of the results of all submodels 



Learning the Verifiable Ensemble 

Our ensemble requires that the default class (Cpref), that is, the default state of the 
safety-related system, must not be misclassified by any of the trained submodels: 
Vy = Cpref : \y — gj (ttpj (~v^)) I = 0. Algorithm 1 shows the learning algorithm of 
our ensemble method. For the sake of simplicity it is defined that the default class is 
always encoded as zero. 



Applying the Verifiable Ensemble 

The application of our ensemble framework is very simple. Due to the restric- 
tion that every submodel must not misclassify the default class, it is sufficient to 
determine the maximum of the output of all submodels: 



{gj ,pj)€models 



(3) 



where models is the set of submodels that is returned by Algorithm 1 



An Illustrative Example 

The Cubes data set is generated from four Gaussian components in a three- 
dimensional space. For each Class 1 cluster 50 samples are drawn from V(e, , 0.2 x 
I), where e, is a unit vector and I is the identity matrix. One hundred samples of the 
Class 0 cluster are scattered around the origin, drawn from A((0, 0, 0)^, 0.2 x I). All 
submodels are trained as SVMs with Gaussian kernel and the parameter set y = 0.2 
and C = 5. Class 0 is chosen as the default class, Cpref = 0. That is. Class 0 must 
not be misclassified by any learned submodel. Within this example, imbalanced 
misclassification costs (10:1) are used in order to avoid the misclassification of 
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(a) First submodel 




(b) Second submodel 





predicted class 


true class 


Class 0 


Class 1 


Class 0 


100 


0 


Class 1 


4 


146 



(c) Confusion matrix of the ensemble 

Fig. 2 Verifiable Ensemble and the CUBES data set: CLASS 1 samples are marked with circles 
and Class 0 samples are marked with crosses. The decision boundaries are drawn as solid lines 



the default class. The ratio of misclassification costs is usually problem-specific. 
Hence, it ought to be determined according to the given domain knowledge or 
by an experimental evaluation. At the initial state, all two-dimensional projections 
of the Cubes data set are very similar. The best submodel gi, see Fig. 2a, uses 
the projection Ttp^Cv) with Pi = {1,2}. Fifty-three data points from Class 1 
are misclassified by this submodel. Thus, in the next iteration new submodels 
are trained only on samples, which are predicted as Class 0 by the first submo- 
del: D„ew = {(~'^, = 0}. In Fig. 2b the projection with 

P 2 = {2, 3} of the data set llnew and the corresponding submodel g 2 are shown. 
This submodel misclassifies four Class 1 samples. Given the chosen parameter 
set, no further improvements are possible. The final predictive model is / (1^) = 
max{gi(jr^, (1^)), g 2 (^p 2 (^)))- The overall performance of the Verifiable Ensem- 
ble is shown in Fig. 2c: avoiding the misclassification of the default class Cpref = 0 
leads to four misclassified Class 1 samples. 



3 Common Multi-class Extensions 

There are two commonly used approaches to extend binary classifiers to solve multi- 
class problems: (1) a one-against-one extension and (2) a one-against-rest extension. 
A comparison of these methods and an experimental evaluation for support vector 
machines is given in Hsu & Lin (2002). Figure 3 illustrates both approaches. 
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Fig. 3 Illustration of commonly used multi-class extensions of binary classifiers: There are three 
classes: A, B, C. The discriminant functions are given as solid lines. Regions with possible 
inconsistent decisions are labeled with question marks 



One-Against-Rest Multi-class Extension 

This method constructs k = |K| classifiers, where IK = {1, . . . , is the set of 
classes. The model fc^ for class Cf e IK is trained on all samples of class C( against 
all samples from the remaining classes which are combined to a new class Cj = 
IK \ C{, for the sake of simplicity the class label of Cj is set to —1. A new data point 
~v is assigned according to / (T^) = arg max fc ("v") . 

C6IK 



One-Against-One Multi-class Extension 

This method builds k(k — \) / 2 classihers, each for the pairwise combination of the 
classes C{, C[ e IK, t 7 ^ 1. The final classihcation is performed by majority voting - 
that is, the most frequent predicted class label is returned as prediction of the multi- 
class model. 



Risk of Inconsistent Decisions 

The issue of inconsistent decisions of combining binary classifiers to multi-class 
classihers is addressed for instance in Tax & Duin (2002). As illustrated in Fig. 3, 
there can be regions of the input space where the decision of the multi-class models 
might be inconsistent. Those regions are marked with question marks in each hgure. 
For the one-against-rest method, there are two possibilities of an inconsistent deci- 
sion: ( 1 ) there are several binary classihers predicting different class labels for one 
given data point. Such regions are (A,B ?), (A,C ?), (B,C ?). (2) There are regions, 
where all classihers are predicting the “rest” class, (A,B,C ?). For the one-against- 
one method, there is only one kind of inconsistent decisions possible: several binary 
classihers are predicting different class label for one given data point. The problem 
of several classihers predicting different class labels can be solved by assigning the 
class label at random (Hsu & Lin, 2002) or to assign the data point to the class 
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with the largest prior probability (Tax & Duin, 2002). The second kind of inconsis- 
tent decisions of the one-against-rest method can be acceptable for some problems, 
where “no decision” might be better than a “wrong decision”. Otherwise, one can 
use the same strategy as for the other kind of inconsistent decisions. 



4 The Multi-class Ensemble 

There are two possibilities to extend our binary ensemble framework from Sect. 2. 
The hrst variant uses multi-class models as submodels. The second variant still 
uses binary submodels and performs the multi-class classihcation by the overall 
ensemble. Note: For safety-related problems it is crucial to take into account that 
the commonly used strategies of extending binary classifiers to multi-class classi- 
hers, which are illustrated in Fig. 3, may lead to regions with inconsistent decisions, 
cf. Sect. 3. 



Hierarchy of Misclassification Costs 

In order to avoid inconsistent and undesired solutions all extensions of the Verifiable 
Ensemble require a hierarchy of misclassihcation costs, that is, it is assumed that 
there exists an ordering of the class labels, which allows statements like: “class ci 
samples should never be misclassihed, class C 2 samples might be misclassified only 
as class c\ samples, class C 3 might be classified as class ci or C 2 samples, ...” 

penalty(ci) > penalty{c 2 ) > penalty{cf) > ■■■ ■ (4) 

Such a hierarchy of misclassification costs leads to a confusion matrix as depicted 
in Table 1 . A new data point ~v is assigned to the class label of all predicted 
class labels which has the largest misclassihcation costs. For safety-related prob- 
lems, such a hierarchy can be assumed because different states of a system might 
result in different perilous consequences. This issue is closely related to ordinal 
classihcation problems. An SVM-based approach for ordinal classihcation can be 
found in Cardoso, da Costa, and Cardoso (2005). 



Table 1 Confusion Matrix for multi-class submodels in a Verifiable Ensemble. The following 
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Ensemble of Multi- class Submodels 

Combining several multi-class submodels becomes difficult because one can only 
rely on the prediction of the class cj, which has the minimal misclassification cost - 
all other class label predictions might be false positives, cf. Table 1. Thus, it is nec- 
essary to include all samples that are not predicted as class C{ in the training for the 
next submodel. Obviously, the problem becomes a binary classification task (sep- 
arating class C{ from IK \ C{) and using multi-class submodels becomes obsolete. 
Instead, we recommend to use the Hierarchical Separate-and-Conquer Ensemble. 



Hierarchical Separate-and-Conquer Ensemble 

This approach is related to the commonly used one-against-rest approach and is 
illustrated in Fig. 4. It directly follows the hierarchy of the misclassification costs. 
Instead of building all one-against-rest combinations of models, the class with the 
minimal misclassification costs is separated from all samples of the other classes 
with (several) binary submodels. The learning procedure is the same as for the Veri- 
fiable Ensemble, which is described in Sect. 2. If the problem is solved for the class 
with minimal misclassification costs or there are no further improvements possible, 
all samples of this class are removed from the training data set and the procedure is 
repeated for the class which has now the smallest misclassification costs. The learn- 
ing procedure is repeated until the data set of the next iteration has only a single 
class label. The resulting binary classifiers are evaluated according to the hierarchy 
of misclassification costs, that is, in the first step all submodels of the class with min- 
imal misclassification costs are evaluated. If the novel sample cannot be assigned to 
this class, the procedure is repeated for the next class within the hierarchy of mis- 
classification costs. If no submodel predicts the novel sample the sample is assigned 
to the class with the maximal misclassification costs. 




T T ★ 
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(a) Discriminant functions 



(b) Confusion matrix 



Fig. 4 Hierarchical Separate-and-Conquer Ensemble trained on the data from Fig. 3. The follow- 
ing hierarchy of misclassification costs is assumed: penalty(A) > penalty(B) > penalty(C) 
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(b) Confusion matrix 



Fig. 5 One-Vs.-Rest Ensemble trained on the data from Eig. 3. Each model for class C( is trained 
with the objective to avoid the misclassiflcation of all samples belonging to = K \ ce . Missed 
samples are denoted by question mark in the confusion matrix 



One -Vs. -Rest Ensemble 

This approach follows the one-against-rest multi-class classification approach. It is 
illustrated in Fig. 5. For every class C{ e IK vs. Cj = IK \ C{ a complete binary 
Verifiable Ensemble fc^ (T^) is trained, cf. Sect. 2. The class Cj is chosen as the 
default class Cpref to avoid the misclassification of any sample belonging to IK \ C{. 
For the sake of simplicity Cj is encoded as —1. The resulting binary models can be 
combined by determining the maximum: 

/ (1^) = arg max fv ) . (5) 

Cf €K. 



This approach is the easiest way to extend the binary submodeling approach to 
multi-class modeling, but it shows a lack of performance for overlapping data sets: it 
is possible that certain data points will be assigned to the class Cj by every submodel 
and that some classes cannot be separated from the other classes due to overlap- 
ping of the classes in all projections. This approach still yields ambiguous decisions 
within the input space, as shown in Fig. 5. Such ambiguities can be resolved by the 
hierarchy of misclassification costs. 



Related Work 

In Szepannek and Weihs (2006), a pairwise variable subset selection method is pro- 
posed in order to extend binary classifiers to solve multi-class problems with an 
one-against-one classifier combination. In contrast to our approach this approach 
does not limit the number of dimensions included for learning the submodels - all 
input dimensions that provide statistically sufficient information are included for 
learning a single submodel to separate a pair of classes. Our approach may use 
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(a) 1^* submodel. 




0 1 2 
(b) 2“'^ submodel. 





Fig. 6 Hierarchical Separate-and-Conquer Ensemble and the Multi-class CUBES data set: 
Class 1 samples are shown as circles, CLASS 2 samples are shown as crosses, CLASS 3 samples 
are shown as downward-pointing triangles, and CLASS 4 samples are shown as upward-pointing 
triangles 



several submodels with limited dimensionality to solve the same subproblem, while 
each submodel remains visually interpretahle. Although the pairwise variable sub- 
set selection method can give a good insight into the importance of the different 
dimensions and might achieve a better predictive performance, this approach does 
not allow the validation of each submodel. Thus, this approach will not satisfy the 
requirements of safety-related application problems. 



An Illustrative Example (Continued) 

We extend the example which is discussed in Sect. 2 to a four-class problem: the 
Class 2 samples are drawn from A((0.0, 0.0, 0.0)^, 0.2 x I), the Class 3 sam- 
ples are drawn from A((0.5, 0.5, 0.5)^, 0.2 x I), the Class 4 samples are drawn 
from A((1.0, 1.0, 1.0)^, 0.2 x I), and the samples of Class 1 are drawn from 
A(e„ -|-/ xO.5, 0.2x1), / = {0, 1,2}. For this multi-class problem the following hier- 
archy of misclassification costs is assumed: penalty(ChAss4) > penalty(ChAss 3) > 
penalty {Class 2) > penalty{ChAss 1). 



Hierarchical Separate-and-Conquer Ensemble 

This approach solves the problem with four submodels, all shown in Fig. 6. The 
first submodel separates most of the Class 1 samples from the samples of the other 
classes. The remaining Class 1 samples are removed by the second submodel - 
further improvements in predicting Class 1 are not possible. Thus, according to the 
hierarchy of misclassification costs, the third submodel separates the samples drawn 
from Class 2 from the samples of Class 3 and Class 4. Finally, the fourth submodel 
separates the Class 3 samples from the Class 4 samples. 

One -Vs. -Rest Ensemble 

This example shows the limitations of the One-Vs.-Rest Ensemble approach: it is 
not possible to build one-vs.-rest models for Class 2, Class 3, and Class 4 without 
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misclassifying samples from Class 1 . The only models returned by this approach are 
the same as shown in Figs. 6a and 6b, that is, only Class 1 samples can be predicted 
correctly, all other samples are predicted as “don’t know”. 



5 Conclusions 

To be able to apply machine learning approaches in the field of safety-related prob- 
lems it is crucial to satisfy the domain experts’ demands that the learned solution 
solves the right problem and complies with the functional specifications. Thus, it is 
important to provide interpretable and verifiable models. 

Our ensemble framework for classification problems greatly facilitates the use of 
machine learning methods in the field of safety-related applications. It requires that 
the given input dimensions allow a (at least partial) separation of the classes within 
the low-dimensional subspaces. The learned submodels are visually interpreted and 
evaluated by the domain experts in order to avoid an unintended extrapolation and 
interpolation behavior. This is particularly made easier by the fact that the submo- 
dels are trained on the original input dimensions, allowing the experts to directly 
evaluate the trained models within their domain knowledge. Thus, the correctness 
of the learned overall solution can be guaranteed although the submodels are trained 
on small subspaces of the input space and the training data might be sparse in 
the high-dimensional space. The ensemble of the submodels compensates for the 
limited predictive performance of each single submodel. The proposed multi-class 
extensions of our binary classification approach maintain the same desirable prop- 
erties by following the hierarchy of misclassification costs and can avoid possible 
inconsistencies that might be induced by commonly used multi-class extensions. 
The One- Vs. -Rest Ensemble is appropriate for problems where unassigned sam- 
ples are acceptable but a very low false-negative rate is required - the Hierarchical 
Separate-and-Conquer Ensemble also achieves a low false-negative rate and can 
capture more samples by its sequential covering algorithm. 
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Dynamic Disturbances in BTA Deep-Hole 
Drilling: Modelling Chatter and Spiralling 
as Regenerative Effects 



Nils Raabe, Dirk Enk, Dirk Biermann, and Claus Weihs 



Abstract The BTA deep-hole drilling process is a process that very often is one of 
the final steps in the production of expensive workpieces. For example axial bores 
in turbines or compressor shafts are produced with this process. A serious problem 
in deep-hole drilling is the formation of dynamic disturbances that may be subdi- 
vided into the most common disturbance types chatter and spiralling. Chatter shows 
in self-excited rotational vibrations which lead to an increased tool-wear while spi- 
ralling is governed by bending vibrations and causes holes with several lobes. Since 
such lobes are a severe impairment of the bore hole the formation of spiralling has 
to be prevented. One common explanation for the occurrence of spiralling is the 
intersection of time varying bending eigenfrequencies with multiples of the tool’s 
rotational frequency. Little is known about which specific eigenfrequencies are cru- 
cial. Furthermore an underlying assumption of this explanation is, that the resulting 
holes in cross-sectional view are appearing as a curve with constant width. This 
assumption implies that spiralling results from a parallel displacement of the drill 
head. We disprove this assumption and show a way how stability charts for the 
classification between stable and unstable processes can be computed by means of 
simulations. These simulations result from statistical-physical models which model 
the disturbances chatter and spiralling as regenerative effects. 

Keywords Deep-hole drilling ■ Regenerative effect ■ Statistical-physical modelling. 



1 Introduction 



Deep hole drilling methods have been developed for the production of holes with 
a length that exceeds three times its diameter. If the diameter exceeds 20 mm, usu- 
ally the BTA (Boring and Trepanning Association) deep hole machining principle 
is employed (see VDI, 1974). The working principle is illustrated in Fig. 1. The 
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damper device 

Fig. 1 BTA deep hole drilling, working principle (Webber, 2007) 




Fig. 2 Spiralling {left) and chatter (right) marks 



workpiece is rotated around the boring bar and shifted from the right to the left 
together with the oil supply device in hxed distance during the process. Note that 
this movement of the oil supply is the major cause for the time- varying dynamics of 
the system. 

Due to the necessarily slender shape of the boring bar the tool has a relative 
low torsional and bending stiffness. Hence the deep-hole drilling process is typi- 
cally subject to dynamic instabilities. These instabilities most commonly show in 
the dynamic disturbance types chatter and spiralling. While chatter leads to tight 
marks in the ground of the hole and to an increased wear of the tool, spiralling 
causes vast, coil-shaped marks in the hole wall (Fig. 2). Since the BTA deep-hole 
drilling process very often is one of the last steps in the production of very expen- 
sive workpieces like airplane turbines which furthermore require a very good surface 
finish it is of primer interest to avoid these disturbances. 



2 Chatter and Spiralling as Regenerative Effects 

Chatter appears in shape of self-excited torsional vibrations. This self-excitation is 
a regenerative effect that can be explained in the following way. When the pro- 
cess starts, the workpiece is rotated around and shifted on the hxed boring bar as 
illustrated in Fig. 3. 

Ideally the boring bar rests in silence but in fact it starts oscillating in torsional 
direction because the system and especially the torsional eigenfrequencies of the 
tool are excited by the machining process. As a result the cutting edge cuts deeper 
and less deep alternately into the workpiece producing a wavy surface as shown 
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Fig. 3 Starting cutting process {left). Chip thickness increases linearly during the first revolution 
and then stays constant (right) 




Fig. 4 Cutting process with oscillating cutting edge (left). Chip thickness varying with the same 
frequency (right) 



in Fig. 4. Valleys and peaks of this wave pass any given point on the bar with the 
same frequency the bar is vibrating with which commonly is the fewest damped 
torsional eigenfrequency. After one revolution of the boring bar the wave reaches 
the position of the cutting edge again at the other side of the actually cut chip. Flence 
both sides of the chip vary with the same frequency as the bar oscillates. As the chip 
thickness is proportional to the cutting force it depends on the phase shift wether 
this frequency is damped or excited. The nearer the phase shift is to a multiple of 
nl2, the higher is the amplitude of the cutting force oscillation and hence the higher 
is the self-excitation (Wolfram, Gepperth, Sandamirskaya, Webber, Raabe, et al., 
2006). 

In contrast to chatter, spiralling usually is not been treated as a regenerative effect. 
In trepanning processes spiralling typically appears in shape of a periodic deviation 
of the bar from the ideal line leading to multi-lobed holes. If this explanation applies 
such holes have to be a curve of constant width implying the holes to have an uneven 
number of lobes. However, in former experiments we observed processes in which 
spiralling led to holes with even, e.g., four numbers of lobes (Fig. 5). As spiralling is 
known to get likely whenever bending eigenfrequencies of the boring tool cross mul- 
tiples of the rotational frequency (Gessese, Latinovic, & Osman, 1994), in former 
work we concentrated on the estimation of bending eigenfrequency courses. By the 
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Fig. 5 Left: Spectrogram of process with spiralling after an intersection of the second bending 
eigenfrequency with the quadruple of the rotational frequency. Middle: Roundness error before 
(top) and after (bottom) the intersection. Right: First four eigenmodes; second eigenmode with 
highest tilt at drill head being responsible for spiralling 



investigation of not only the eigenfrequencies but also the eigenmodes we found out 
that the stiffness influence of the workpiece supporting the bar was that high, that 
now mode showed a parallel deviation at the end of the bar. Instead, some modes 
showed a clear tilt at this position and obviously only these modes were responsible 
for spiralling. On the one hand we then were able to evaluate each eigenfrequency 
with respect to its ability to cause spiralling. On the other hand, this result gave us 
a basis for a slightly different explanation of the spiralling formation then the com- 
mon one. The bar starts oscillating in lateral direction with the eigenfrequencies and 
therefore the ideal boring line is left. However, the bar does not deviate at its end but 
tilt. Therefore the diameter of the hole varies. If now one eigenfrequency with a very 
high tilt of the corresponding eigenmode coincides with m times of the rotational 
frequency, this eigenmode reaches maximum amplitude m times per revolution, i.e., 
m lobes are cut into one circuit of the hole. In particular, maximum amplitude is 
reached after exactly one revolution and so the tool cuts into to the same lobes 
again during the next revolution. Therefore the chip thickness gets lower which as 
in the case of chatter explained above is proportional to the cutting force. Hence 
the involved eigenfrequency gets less damped and the regenerative effect begins. 
The regeneration implies that spiralling propagates even when the coincidence of the 
eigenfrequency with the multiple of the rotational frequency ends (compare Fig. 5). 



3 Modelling Chatter 

3.1 Torsional Vibration Model 

The explanation for the formation of chatter as described in the last section has been 
used to state a realistic chatter simulation model. This model consists of a torsional 
vibration model and a process model of the drilled surface and the drilling torque 
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Fig. 6 Torsional vibration model, k, denotes torsional stiffness, 4>i the torsion angle of the i-th 
element, 0 the mass moment of inertia and Ci the torsional damping 



(Webber, 2007). The model type of the torsional vibration model is a discretized 
analogous model of the boring bar with elements movable in torsional direction. 
Figure 6 shows an exemplary model of this type with five degrees of freedom. The 
geometrical and physical properties length, mass, inner and outer diameter and tor- 
sional stiffness of the elements can be computed directly from the properties of the 
boring bar. 

From this model the actual angular torque and the torsion angle of each ele- 
ment can be computed in dependence on the drilling torque. The dependency of the 
drilling torque on the process parameters rotational frequency, cutting speed, feed 
and tool diameter is postulated by the process model. 



3.2 Chatter Simulation 

The simulation of the process is obtained by subsequently updating the angle of the 
tool relative to the workpiece, the cutting thickness and the drilling torque for each 
time increment. 

As comparisons showed drilling torque time series computed by the simulation 
model show the same behavior as those from real processes. By the model it is 
possible to compute so called stability charts with respect to chatter. Stability charts 
are two dimensional diagrams representing the extend of chatter in dependence of 
the machine parameters rotational frequency and feed. Figure 7 shows an exemplary 
stability chart. There light areas indicate stable settings while dark areas represent 
setting which lead to a high extend of chatter. 

With the stability chart it is possible to classify parameter settings into stable and 
unstable processes. From an economic point of view both high feeds and cutting 
speeds are desirable. Typically high feeds imply a higher extend of chatter. With 
the stability charts it is possible to find stable regions of both high feed and speed 
between the with respect to speed periodically appearing unstable areas. However, 
sensitivity analyses will be performed to test how severe stochastic deviations in the 
process parameters and relations will affect the resulting stable areas. 
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Fig. 7 Stability chart for chatter 



4 Modelling Spiralling 
4.1 Bending Vibration Model 

As bending eigenfrequencies affect the occurrence of spiralling, a physical model 
of bending vibrations had been proposed. Its type is also a discretized analogous 
model; in contrast to the torsional rotors the bending vibration model consists of 
elements movable in lateral direction. Figure 8 shows a simplified version of the 
model with five degrees of freedom. 

By solving the equations of motion of the system it is possible to compute the 
bending eigenmodes and -frequencies in dependence of the physical and geomet- 
rical properties of the system. However, some of the parameters, primarily the 
stiffness influences of the supporting elements damper and oil supply device are 
not known and therefore have to be estimated. As the bending eigenfrequencies 
turned out to show very well in the signal of the structure borne sound a statisti- 
cal model had been proposed that allowed the maximum likelihood estimation of 
the unknown parameters based on the spectrogram of this signal (Weinert, Weihs, 
Webber, & Raabe, 2007). 



4.2 Clustering of Increasing Eigenfrequency Courses 

Because the evaluation of the likelihood function is very time consuming it up to 
now had been restricted to the lower part of the spectrogram where the relevant 
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Fig. 8 Bending vibration model, ki, denotes bending stiffness, m the mass of each element, x, the 
deflection of the / -th element and ks the stiffness influence of the support 
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Fig. 9 Low frequency {left) and complete (right) spectrogram of the structure home sound of a 
deep hole drilling process 



eigenfrequencies obviously are located (Fig. 9 left). Flowever, also the higher parts 
contain information about courses of eigenfrequencies (Fig. 9 right). 

Because the signal is measured at a hxed position between the damper and the 
oil supply device which moves towards the damper during the process, the most 
prominent frequencies are increasing. This especially holds for higher frequency 
regions. However, the eigenfrequencies being critical with respect to spiralling are 
the decreasing ones, because these frequencies are active at the other side of the oil 
supply where the drill head contacts the workpiece. We here show a strategy how 
the information of the increasing eigenfrequencies can be used to improve starting 
values for the maximization of the likelihood and by this to improve the estimation 
of the relevant frequencies. 

Let S G he the spectrogram of the structure borne sound, where s.t is the 

power spectrum of the t-th time frame, F the number of Fourier frequencies and T 
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the number of time frames. Let furthermore f\, . . . , ff be the Fourier frequencies 
in Hertz and , . . . , be the end points of the time frames in seconds. 

To improve consistency, a smoothed version of the spectrogram may be used. 
In our investigations we obtained smoothing in frequency direction by using the 
Daniell estimator of the spectrum. In time direction the spectrogram was exponen- 
tially smoothed (Wei, 2005). 

Next a matrix X e is constructed which row- wise consists of the pairs of 

frequency /j and time point tj of all n elements Sij lying above a predefined cutoff 
value. The determination of the eigenfrequency courses will be obtained by cluster- 
wise fitting quadratic regressions of frequency on time to this data. However, as start 
and end frequencies of different courses in general may overlap, it is not possible 
to fit these regressions to frequency bands like in Raabe, Theis, and Webber (2004). 
Instead, the clusters are determined by a “K-Conditional-Means”-algorithm, that 
proceeds in the following steps: 

1. Set initial cluster number Kq and iteration numbers J\ and Jx- 

2. Initialize regression parameters: fiok = k ■ (Jf/Kq) and Pi k = 0, k = 

3. Set actual iteration number j to zero. 

4. Increase j by one. 

5. Compute matrix X e consisting of all rows x,. of X for which 2 < 

j ■ {tr/ J\)- 

6. Assign all rows of X to their nearest conditional mean by defining the cluster 
membership vector c with 

a = &rgmmk=i,...xo{[ki,\ ~ Wo,k + ^ukXia)?}, i = 

7. Update regression parameters jSo.i: and of each cluster k by fitting OLS 
regression of x.i on x. 2 - 

8. If j < J\ continue with 4. 

9. Fit quadratic regressions x,-,i = Po,k + Pi.kXi ^2 + Pi.kxf 2 to each cluster and 
repeat steps 3-7 J 2 times with fixed X = X and quadratic instead of linear 
regressions. 

10. Eliminate redundant clusters, i.e., nearly empty clusters or clusters the regres- 
sion curves of which intersect one or more of the other ones. 

1 1 . Refit regressions in remaining clusters. 

Figure 10 shows a visualization of the different stages of the algorithm. There the 
fitted eigenfrequency courses are plotted by solid lines and the frequencies above 
cutoff by light gray dots. 



5 Outlook 

Once the increasing frequency courses have been determined it is possible to fit 
the parameters of the physical model by minimizing the sum of squared errors 
between the courses fitted as described in the previous section and those implied 
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Fig. 10 Visualization of different stages of the algorithm. Top left: Initialization of ATq equally 
spaced regression lines with zero slope (step 2). Top right: Linear regressions after J\ iterations of 
steps 4—8. Bottom left: Quadratic Regressions after J 2 iterations of step 9. Bottom right: Final fit 
after elimination of redundant cluster (step 11) 



by the physical model. The resulting parameter values should be good starting val- 
ues for the maximization of the likelihood function of the statistical model. One the 
one hand in this way the chance of getting stuck in local optima will decrease one 
the other hand the optimization process will significantly speed up. With properly 
estimated parameter values it then is possible to also deduce the decreasing eigen- 
frequency courses which as explained above are the critical ones for the occurrence 
of spiralling. 

The explanation for spiralling as a regenerative effect given in this paper will 
be used to set up a simulation model for spiralling in a similar way as that one for 
chatter. First attempts in this direction showed promising results. One application 
of the simulation model will be the construction of stability charts for spiralling. In 
contrast to those of chatter these charts will have to be dynamic, as the bending fre- 
quencies are time-variant. First investigation showed that low feed values coincide 
with a higher chance for spiralling. Because the opposite holds for chatter simulta- 
neous strategies for the avoidance of both disturbances have to be developed. This 
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task will be solved by means of multicriteria optimization methods like desirability 
indices. Further criteria like hole quality and process costs will be included into this 
optimization. 
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Nonnegative Matrix Factorization for Binary 
Data to Extract Elementary Eailure Maps 
from Wafer Test Images 



Reinhard Schachtner, Gerhard Poppel, and Elmar Lang 



Abstract We introduce a probabilistic variant of nonnegative matrix factoriza- 
tion (NMF) applied to binary datasets. Hence we consider binary coded images 
as a probabilistic superposition of underlying continuous-valued basic patterns. 
An extension of the well-known NMF procedure to binary-valued datasets is pro- 
vided to solve the related optimization problem with nonnegativity constraints. We 
demonstrate the performance of our method by applying it to the detection and 
characterization of hidden causes for failures during wafer processing. Therefore, 
we decompose binary coded (pass/fail) wafer test data into underlying elementary 
failure patterns and study their influence on the quality of single wafers. 

Keywords Binary data ■ Failure patterns • Nonnegative matrix factorization. 



1 Introduction 

Manufacturing a microchip requires up to hundreds of productive steps, depending 
on the complexity of its components. Lifetime, performance speed and other quality 
aspects render a set of specihcations taylored on the intended application field. The 
overall functionality of the completed chips is measured in a test series after the 
last step of production. A chip is labelled “pass”, if it satisfies all investigated fea- 
tures, and “fail” otherwise. A disordered or wrongly calibrated productive machine 
can cause the failing of a quality check of a series of chips. The identification and 
explanation of such systematic errors is a highly interesting and nontrivial problem. 
While several individual root causes can be the responsible trigger for a dropped- 
out device, only the overall “pass”/“fail”-information for the chip is available at any 
case. In this paper we introduce a new method to model the systematic part of errors 
by a superposition of individual failure causes. 
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1.1 Notation 

Measurement data from N wafers constitute a binary N x M data matrix X, each 
row of which contains all M chips of one wafer aligned. Matrix entry Xij contains 
the information whether chip j on wafer i has passed all functionality tests (0), or 
failed at any of them (1). In the following, we use X, * to denote the i-th row and 
X*y for the j -th column of X, meaning one whole wafer i or one chip position j 
on all wafers, respectively. 



2 Nonnegative Matrix Factorization 

Nonnegative matrix factorization (NMF) is a very popular technique for the analysis 
of real-valued multivariate datasets. In context of Blind Source Separation, NMF is 
intended to explain a data generating process as strictly additive superposition of 
nonnegative sources. A nonnegative N x M data matrix X is approximated by a 
N X K matrix W and a K x M matrix H such that 



The number of basis components K is usually chosen so that {N + M)K < NM . In 
that case, the product WH can he regarded as a compressed version of the original 
data X (Lee & Seung, 1999). Technically, the task of a NMF can be formulated as 
an optimization problem by minimizing a suitable cost function, such as the squared 
Euclidean distance 



with respect to the non-negativity constraints W. H > 0. Other cost functions, such 
as the Kullhack-Leibler- (Lee & Seung, 1999), Bregman- (Dhillon & Sra, 2006) or 
Csiszar’s (Cichocki, Zdunek, & Amari, 2006) divergences, have been proposed in 
the literature. Additional sparse- or smoothness parameters to enforce solutions with 
desired characteristics, as well as a variety of optimization techniques to achieve 
the desired matrix decomposition have been discussed (see, e.g.. Berry, Browne, 
Langville, Pauca, & Plemmons, 2007; Cichocki, Zdunek, & Amari, 2008 for a 
survey). 



2.1 Alternating Least Squares Algorithm for NMF 

A very popular method to minimize the squared Euclidean distance / (W, H) (2) is 
called Alternating Least Squares procedure. It can be summarized by the following 
steps (see, e.g., Berry et ah, 2007): 



X?^WH (W,H>0). 



( 1 ) 




N M 



( 2 ) 
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Initialize W at random 

iterate the following equations for i=l to maxiter: 



Solve for H in matrix equation - — = 0. 



^ 3H 
Set all negative elements in H to 0. 



( 3 ) 

( 4 ) 




( 5 ) 

(6) 



Set all negative elements in W to 0. 



ALS-procedures which properly enforce non-negativity of W and H can be proven 
to converge towards a local minimum of the cost function (Berry et ah, 2007). Unfor- 
tunately, the rough projection onto the nonnegative orthant after every optimization 
step can cause convergence problems. In case of convergence, however, projected 
ALS is extremely fast. Computing several runs using different random initializa- 
tions thus still outperforms other methods like gradient descent and multiplicative 
update rules with respect to the required computational time. Despite its conver- 
gence problems, the projected ALS method is very attractive for NMF applications 
(see Cichocki et ah, 2008). 



3 NMF for Binary Datasets 
3.1 Generative Model 

The measurement outcome “fail” of a microchip can have several possible reasons. 
Here we assume that the data is generated by K individual root causes which are 
acting simultaneously without influencing each other and that the probability for a 
chip to be “pass” can be expressed by 



Furthermore, there are two aspects of the data generating process to be consid- 
ered: 

1 . Each root cause can have varying impact on different wafers. 

2. A root cause can be related to a characteristic pattern on a wafer. Such a pattern 
manifests itself in a certain “fail”-likeliness for each chip position. 

Incorporating both aspects, we employ parameters Wik > 0 and Hkj > 0 to 
describe the probability that chip j of wafer i is “pass” as 



K 



/■(“pass”) = F(“pass”|root cause /:). 



( 7 ) 



k=l 
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Model summary 



N : no. of objects 
M : dimension 
K : no. of basic patterns 



e N 
e N 

« min(iV, M) 



X : data matrix 
W : coefficient matrix 
H : pattern matrix 



N xM 
N xK 
KxM 






= 0 : “pass” 
= 1 : “fail” 



e {0, 1} 



W.k 

Hkj 



weight of pattern k in object i £ [0, oo[ 
value of pattern k on position j £ [0. 1] 



P(Xij =0|H,W) 



g-[WH]y 



£ ] 0 . 1 ] 



P(Xij = 1|H,W) 



1 _ g-[WH]y 



£ [ 0 , 1 [ 



Fig. 1 Summary of the model 



P{Xij = 0) = g-WikXHkj ^ g-[WH],;^ 

k=\ 

where we recognize the product of nonnegative matrices in the exponent of the last 
expression. 

The parameters Wik reflect the influence of root cause k on wafer i such that 
Wrk < Wsk means that root cause k is more strongly expressed on wafer A'j* than 
on Xr*. 

We refer to the row vector H^;* = (Hk \. . . . , HtM) > 0 as pattern k where 
Hki < Hkm implies that it is more likely to observe a 1 on chip position m than 
on position / due to root cause k. In our description, the term probability is avoided 
for the parameters Wit and Hkj due to scaling indeterminacies. Only the terms 
g-WikxHkj jjave a probabilistic interpretation. 

Summarizing, the “pass’7“fail”-probabilities given the hidden root causes are 
(Fig. 1) 

P{Xij = 0|H,W) = (9) 

P(Xij = 1 |H, W) = 1 - (10) 

Both matrices W and H are nonnegative and are related to the binary matrix 
X as described. The challenge of finding these matrices can thus be viewed as an 
extension of NMF for this kind of binary datasets. 



3.2 Bernoulli Likelihood 

The Bernoulli likelihood is a natural choice for modelling binary data. Denoting pij 
the probability that Xij = 1 , the Bernoulli likelihood of one entry is 
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P{Xij |H, W) = 4" (1 - (11) 

Together with (10) this leads to an overall (log-)likelihood 

N M 

+ Xij [WH],y. (12) 

1=1 7=1 

In Kaban, Bingham, and Hirsimaki (2004), a symmetric linear model is used to 
approximate the Bernoulli parameter of a similar problem. The authors use an EM- 
type approach to maximize a lower bound for the log-likelihood. Here, we propose 
a completely different strategy for the optimization. 

We combine an Alternating Gradient Ascent Algorithm in the variables W and 
H together with a preceding search for appropriate initial values in order to reduce 
the risk of getting stuck in “poor” local maxima. Note that this does not imply that 
the algorithm will necessarily find the global maximum. 



3.3 Optimizing the Log-Likelihood 

3.3.1 Alternating Gradient Ascent Algorithm 

After some suitable initialization of the parameter matrices W and H, an iterative 
gradient ascent scheme for the log-likelihood (12) is given by 



Wik < 


dLL 

Wik + r]w „ . 

aWik 


(13) 


Hkj ^ 


dLL 

-Hkj+riH^Hkj- 


(14) 



While one of the matrices is updated, the other one is kept fixed. 

Due to the non-negativity constraints on all Wit, Hkj, the stepsize parameters 
and i]H have to be controlled carefully. Especially for small stepsizes, however, 
convergence can be extremely slow. Even in the unconstrained case, gradient ascent 
algorithms can only be guaranteed to find a local maximum for sufficiently small 
kjw. Vh- Particularly the logarithm in (12) can cause serious global convergence 
problems by inducing local maxima to the log-likelihood function. Single entries 
Xij = 1 with a small probability 1 — may pin the optimization algorithm. 

In the following, we derive a strategy how to find a “good” starting point for the 
Alternating Gradient Ascent. 
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3.3.2 Alternating Least Squares on a Simplified Problem 

In order to obtain suitable initial matrices W and H we apply a preceding standard 
NMF on a simplified version of the true optimization task. Therefore, we introduce 
an auxiliary variable a e]0, 1 [ and set 



P(Xij = 1) = 0, if Xij = 0 

P{Xij = \) = a. ifA,; = l 



for all i , j. 



(15) 



a can be regarded as an averaged probability P{Xij = 1) given that the observed 
realization was X^ = 1. For all {i, j) this can be summarized by 



aXij = 1 - _in(i _ aXij) = [WH],;. (16) 



Note that the left hand side of the last equation is always nonnegative since a e]0, 1 [. 
Substituting X'-j =: — ln( 1 — aXij ) we recover a standard NMF problem X' WH. 
We chose the squared Euclidean distance as a cost function 



N M 

E{a, W, H) = ^ ^ (ln( 1 - aX^j ) + [WH],y )" 



1=1 7=1 



(17) 



and apply the Alternating Least Squares Algorithm as described in Sect. 2. 1 in order 
to minimize (17) with respect to W > 0 and H > 0. The ALS-updates are given by 

N 

Hrs ^ max{e, - ^[(W^W)"' W^]„- ln(l - aX,-,)}, (18) 

i = \ 

M 

Wi^ ^ max{e, - ^ ln(l - aX,y)[H^(HH^)-%„,}. (19) 

7 = 1 



To avoid getting stuck in local minima of the cost function (17) the procedure is 
repeated using different random initializations of H and W and only the solution 
with the smallest Euclidean distance is preserved. 



3.3.3 Determining the Parameter a 

Note that the global minimum of (17) as a function of a, W and H is given by £ = 0 
when a -5- 0, W, H = 0 independently from the data X . Thus, we determine the 
optimal a by the log-likelihood of the estimated W(Q!),H(a). If the parameter a 
is chosen too small, the probabilities P(Xij = 1) are consistently estimated too 
small and the related log-likelihood will be small. On the other hand, a large a ^ I 
leads to very large values [WH]y for any Xij = 1. Due to the matrix product this 
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Fig. 2 Log-likelihood of the approximations computed by the ALS-method as a function of a for 
10 random initializations. The best value is obtained for a = 0.87 in this example. The horizontal 
line denotes the true log-likelihood 



implies an increase of the whole column H*y and/or row W, * at the expense of the 
reconstruction accuracy for zeros in the same row and column (X, j = 0, X^j = 0). 

From simulations on toydata sets (see Sect. 4. 1 for details), we observed that the 
best obtained log-likelihood LL(X, W(q!), H(a)) among several randomly initial- 
ized runs resembles a concave function of a (see Fig. 2). Thus, a Golden Section 
Search procedure can be applied to obtain the optimal a in a reasonable amount of 
trials and computational time. 



3.3.4 Semi-supervised Mode 

The algorithm presented above can easily be run in a semi-supervised fashion. A 
fixed pattern of interest can be stored into a row of H at initialization. While during 
the optimization only the randomly initialized K — \ rows of H are updated, the 
updates for W remain as usual. For example, an uninformative constant pattern of 
ones = (1 , . . . , 1) can be utilized to model uniformly distributed noise. 



3.4 Other Cost functions 



The Bernoulli likelihood discussed above is not the only possibility to handle 
the binary NMF problem. We also experimented with the following class of cost 
functions: 

N / M 

Ep,{X, W, H) ^ ^ \Xij - 1 + 

i=i \j=i 



p.q>0, 



( 20 ) 



a special case of which is simply the squared Euclidean distance 

N M 

£21 (X, W, H) := y] (X,y - 1 + ^ 

1 = 1 7 = 1 



( 21 ) 
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It is our experience that optimizing cost functions of this type also yield useful 
decompositions of binary datasets. However, in this paper we focus on the Bernoulli 
likelihood approach. 



4 Results 

4.1 Toy data Example 

First, we present the performance of the above algorithm on a constructed toydata 
example. 

K = A fixed failure patterns Hi», . . . ,H 4 * were created, each constituting a 
squared 30 x 30 image (see Fig. 3, left-hand side). Entry Hkj is represented by the 
greyscale value of pixel j on pattern k {j = 1, . . . , 900). In this example we use 
three binary patterns (white: 0, black: 1) and one pattern of values graded from zero 
In the center to one on the edges. By means of a randomly generated 1,000 x4 
coefficient matrix W, the failure probabilities pij = 1 — were constructed. 

Finally, a binary data matrix X of realizations {0,1} was created by setting the (/ ,y )- 
th entry to 1 with probability pij (see Fig. 4 for examples). Using the binary matrix 
X and the correct number of sources ^ = 4 as input for the ALS algorithm, the best 
achieved log-likelihood value was obtained using a = 0.87. As displayed in the 
centered image of Fig. 3, the simplified ALS algorithm yields quite good approxi- 
mations of the original source patterns in this example. Feeding the ALS solutions 
as starting points for the Alternating Gradient Ascent, after 1 ,000 iterations a nearly 
perfect reconstruction of the original patterns is achieved (Fig. 3, right). Note that in 
the images W and H are rescaled such that the maximum value in each pattern 
is given by one. 

While the top row of Fig. 4 contains the original randomly generated coefficients 
Wik, the second row shows the corresponding binary images X, *. As an example. 




Fig. 3 Left: 30 x 30 source patterns valued in [0. 1] (whiteiO, black: 1). Center: Reconstruc- 
tions gained via ALS. Right: Maximum likelihood solutions 
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Fig. 4 Toydata examples. Top: Original coefficients W,*. Second row: Binary realizations X,*. 
Third row: Coefficients gained by ALS. Bottom: Coefficients after refinement by Gradient Ascent 



the left image in the second row can he represented by the third and fourth basis 
component, while the second image consists of the fourth component only. The last 
two rows contain the estimated coefficients by the ALS-method and after refinement 
by Alternating Gradient Ascent. 



4.2 Real World Example 

Finally, we demonstrate the performance of our method on a real world dataset. 
The data stems from a special kind of measurements. These kinds of measurements 
are aimed to identify latent structures and detect potential failure causes in an early 
stadium. 

Given a set of A = 3.043 wafers, each containing M = 500 chips, we esti- 
mated K = A source patterns Hi*, . . . ,H 4 * and the related weight coefficients 
W*i, . . . , W *4 (see Fig. 5). We identified four clearly distinguishable patterns of 
different characteristics: The first source pattern shows a region of higher fail- 
probability on the upper side of the wafer. The second pattern constitutes a ring 
of fails on the edge zone. While the third pattern is a repeated structure consisting 
of a group of neighboring fails at constant distance from each other, the fourth pat- 
tern is a bead centered on the wafer. The related W-matrix stores the activity of each 
of the four putative sources on each wafer separately. This new representation of the 
data contrasts wafers affected by the detected sources with untouched ones and is 
intended to support the detection of potential error causes. 
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Fig. 5 Estimated source patterns Hj H 4 * (left) and contribution coefficients (right) 
W*i , . . . , W *4 for a realdata example comprising 3,043 wafers and 500 chips per wafer 



5 Conclusion 

We introduced a probabilistic framework to model systematic failure causes in the 
microchip production. Therefore a new methodology was presented which utilizes 
an extension of nonnegative matrix factorization to this kind of binary datasets. An 
optimization technique was presented which maximizes a log-likelihood function 
using a fast alternating least squares algorithm followed by gradient ascent refine- 
ment. The performance of the overall procedure was demonstrated on an artificial 
and a real world dataset. 
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Collective Intelligence Generation from User 
Contributed Content 
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Staab, Costis Contopoulos, Ioanna Gkika, Byron Bakaimis, Pavel Smrz, 
Yiannis Kompatsiaris, and Yannis Avrithis 



Abstract In this paper we provide a foundation for a new generation of services 
and tools. We define new ways of capturing, sharing and reusing information and 
intelligence provided by single users and communities, as well as organizations 
by enabling the extraction, generation, interpretation and management of Collec- 
tive Intelligence from user generated digital multimedia content. Different layers of 
intelligence are generated, which together constitute the notion of Collective Intel- 
ligence. The automatic generation of Collective Intelligence constitutes a departure 
from traditional methods for information sharing, since information from both the 
multimedia content and social aspects will be merged, while at the same time the 
social dynamics will be taken into account. In the context of this work, we present 
two case studies: an Emergency Response and a Consumers Social Group case 
study. 

Keywords Collective intelligence ■ Mass intelligence ■ Media intelligence ■ Orga- 
nizational intelligence ■ Personal intelligence • Social intelligence. 



1 Introduction 

The public has always played a major role in managing events in small and 
large communities, be the emergency events, the environment, or the organisation 
of public or private activities. The availability of mobile, networked information 
communication technologies in the hands of ordinary people makes information 
exchange increasingly potent and pervasive. The expected evolution for the near 
future is the evolution of Web based services supporting grassroots participation 
by users, customers, and citizens in information sharing in a number of fields, from 
eCommerce (most of which are already ongoing, see O’Really, 2005), to emergency 
response Palen, Hiltz, and Liu (2007) and consumer collective applications such as 
realtravel.com (Conrady, 2007). 
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Existing Web technologies for multimedia annotation and sharing (Flickr,' 
Youtube^), content creation (Wikipedia^), mass question answering (Yahoo!/ 
Lycos^) or social networking (Facebook®) provide exciting new opportunities to 
create innovative services. However, such approaches have reached a number of 
important limits in their evolution; 

1 . Inability to “understand” and, more specifically, inability to manipulate the con- 
tent automatically, leading to failure in making information available for further 
processing, and therefore, failing to exploit the emergence of trends at the social 
and mass level, and the emergence of knowledge about a situation. 

2. Fimited access of such technology to mobile users and to organisations. 

Also, the digital content rapidly reaches a mass that makes relevant information 
extremely complex and costly to handle. Yet, current applications do not fully sup- 
port intelligent processing and management of such information. Thus, users fail to 
access it efficiently and cannot exploit the underlying knowledge. 

In this chapter, novel techniques for exploiting multiple layers of intelligence 
from user-contributed content are presented. These layers, namely the Personal, 
Media, Mass, Social and Organizational Intelligence, constitute together the Collec- 
tive Intelligence. They are a form of intelligence that emerges from the collaboration 
and competition among many individuals and that seemingly has a mind of its 
own. The decomposition of collective intelligence into five layers is a methodolog- 
ical approach for research and development that separates different concerns into 
orthogonal layers. 

Collective Intelligence is extracted by understanding mass user-generated con- 
tent with emphasis on integration and bridging (e.g., social and content dimensions) 
and the mobile and organizational - business aspects. This can be depicted by the 
following formula: 

Impact (Collective Intelligence) > Impact (Fayer, ), / e /, 

where / is the set of layers of intelligence. 

Also, collective intelligence benefits: 

1 . End users who will be able to receive personalised information based on Collec- 
tive Intelligence in a largely automatic way. 

2. Communities which will benefit from the generation of ad hoc services for their 
members and from improved community management. 



* http://www.flickr.com. 

^ http://www.youtube.com. 

^ http://www.wikipedia.com. 
^ http://answers.yahoo.com. 

^ http://iq.lycos.co.uk. 

® http://www.facebook.com. 



Generating Collective Intelligence 



767 



3. Organisations which will be able to manage different levels of intelligence 
to generate knowledge which will form the base of superior decision-support 
services. 

4. Service providers, seeking new outlets for customer and market development, 
beyond provision of products focused on individuals. 

To automate the acquisition of such knowledge from Collective Intelligence is 
a big departure from traditional methods for information sharing, since managing 
Collective Intelligence poses new requirements, for example, semantic analysis has 
to fuse information coming both from the content itself and the social context, and 
additionally, the social dynamics that emerge have to be taken into account. 

In the context of this work, we shall present two case studies. Initially, an Emer- 
gency Response case study will be tackled, where users provide intelligence about 
large scale emergencies, empowering a more effective and informed emergency 
action and at the same time receive information on how to act. A Consumers 
Social Group case study will follow, providing enhanced publishing tools to support 
group activities (e.g., organization of team events) and the ability to extract meta- 
information from content sources and group discussions. Both Use Cases denote 
the important effect of Collective Intelligence as well as its leverage for private, 
commercial and public purposes. 

In Sect. 2 the five layers that constitute the Collective intelligence will be descri- 
bed and. The Emergency Response and Consumers Social Group scenarios will be 
presented in Sect. 3 and the conclusions in Sect. 4. 



2 Collective Intelligence 

The five layers that constitute the Collective Intelligence will be described in the 
following subsections. 



2.1 Personal Intelligence 

The Personal Intelligence layer deals with enabling users to both upload and 
access multimedia information submitted to the intelligent services using a range 
of devices, from mobile phones to PDAs and personal computers. Once multimedia 
content is submitted to the intelligent services, a series of processing and analysis 
procedures take place in order to exploit, share and reuse the extracted knowledge. 
User and context modelling paradigms will be employed to enable personalised 
access to the content and knowledge available from the proposed applications. 

Applications and services based on collective intelligence would be expected to 
have reached their peak of usage by 2015. Even though by that time many decisive 
factors such as the consumer trends on technology use (e.g., internet applications 
and live services), or the capabilities of future networks and terminals are expected 
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Fig. 1 High-level representation of the personal intelligence model 
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to have changed, the user-centric interaction model that is expected to take place 
will remain the same. 

This model will consist of a large number of “entities” like the actual events, the 
mechanisms of capturing event information (e.g., image and video cameras, geolo- 
cation possibilities, automatic recognition of voice annotations), the content types, 
the types of users, the devices, the network technologies, the network operators, 
the service providers, organisations, regulatory bodies, etc. All these constitute the 
“degrees of freedom” that dictate the quality of services offered to the end users. In 
Fig. 1 we present a rough representation of this model. It is effectively a user-centric 
interaction model or else, an end-to-end two-way information flow model. 

The rendition of an event by the end users, according to their perception and 
recording of it, is collected and sent to the system, so as to be processed by the 
intelligent services. However, the uplink flow of this personal intelligence confronts 
limitations imposed by natural attributes of the elements through this flow. Specifi- 
cally, due to the different types of users, the diverse mechanisms for content capture 
or the varying capabilities of devices and the access possibilities of the available net- 
works, the actual information submitted to the system’s algorithms may be demoted 
from what is expected or intended. This scaling down of information as perceived 
by the users being present at the time of an event, versus the quality of information 
that Anally reaches the system is visually portrayed in Fig. 2. This has implications 
in the quality of offered services, but to a great extent, is subject to the technological 
advances in the mobile devices and the wireless networks. 



2.2 Media Intelligence 

The first and main step towards efficient “Media Intelligence” deals with automated 
analysis and semantics extraction from raw visual, textual or audio content and 
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associated metadata. Analysis focuses on each modality in isolation and without 
taking into account any contextual information or the social environment. However, 
it does take into account prior knowledge, either implicit, in the form of supervised 
learning from training data, or explicit, in the form of knowledge driven approaches. 

Extracting knowledge from raw data forms a huge research problem on its 
own, so work in this field is expected to advance current existing state-of-the-art 
techniques for each modality, while a significant effort will be devoted on: 

1 . Adapting to the individual domains of interest and intelligence methodologies 

2. Handling heterogeneity of unstructured user-contributed resources 

3. Supporting interoperability with contextual information 

Three main processes are proposed, text analysis, visual information analysis and 
speech analysis. 

In text analysis process, textual information is of fundamental importance in 
every scenario where humans are involved. They are used to pass information 
explicitly to other people. Textual information is pervasive and - with the coming 
into existence of the Web - its availability is increasing. Intelligent techniques are 
required to enable automatic Information Extraction (IE) from text and make this 
information available for further processing. 

Visual information, that is, still images and especially video, tends to impose 
huge requirements on current repositories or social network sites in terms of stor- 
age or transmission due to the size of the data involved, yet its contribution to the 
knowledge and intelligence of related applications remains insignificant. Research 
in disciplines like image processing, pattern recognition and computer vision has 
been ongoing for decades but satisfactory performance can usually only be achieved 
in constrained domains, scales and environments. 

Speech is a natural, pervasive and efficient means for communication among 
people. Therefore, it is the privileged modality in many situations where safety and 
convenience issues require hands- and eyes-free interaction with computers or ask 
for a direct access to information (no menu navigation, no typing). Its ubiquitous 
and easy-to-use character makes also speech the primary communication channel in 
emergency scenarios. 

In the context of media intelligence, we rather focus on realistic conditions 
that are compatible with the concept of Collective intelligence. Eor example, when 
extracting information from a set of recorded phone calls in the emergency sce- 
nario, one can hardly expect a clean noise environment. The speech analysis will 
combine the standard large-vocabulary continuous speech transcription methods 
with advanced keyword-detection techniques based on phonetic search. In this way, 
the speech analysis part of the media intelligence services will be able to identify 
semantically relevant content (e.g., the names of people and places) that would not 
be accessible in traditional systems. 
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2.3 Mass Intelligence 

Masses of users contribute their knowledge to communities in the Weh 2.0. They 
organize and share media such as images on Flickr, videos on YouTube, bookmarks 
on Delicious,^ personal opinions, and others. Within such systems, the users can 
provide feedback by valuating the content provided and conducting assessments. 
This can be done, e.g., by participating in discussions and answering questions in a 
community portal. 

Thus, Mass Intelligence combines the information from mass user feedback in 
order to extract patterns and trends that cannot be extracted by single content items. 
Facts and trends will be recognised and modelled by interpreting user feedback 
on a large scale. The key research challenge of Mass Intelligence is the question 
whether this mass of users can give new insights that would not be possible by 
considering the individual. To this end, we are currently analyzing the Lycos iQ data 
set. Users can ask arbitrary questions on the Lyocs iQ platform. The community 
answers questions in a discussion forum style. The answers are assessed hy the 
community and credit points are awarded to the contributors. The Lycos iQ dataset 
analyzed for Mass Intelligence contains more than 900,000 questions in German 
language and 64,000 questions in English. For analyzing the dataset, four aspects 
are considered: 

• Can mass question answering improve the quality of search results? 

• Can implicit or explicit user feedback result in better ratings and rankings of 
questions? 

• What semantics emerges from collaborative organization of media and know- 
ledge by classification and clustering. 

• How does the mass data categorization and mass behavior change over time and 
how do opinions evolve? 



2.4 Social Intelligence 

Social Intelligence results from the monitoring, analysis, recognition, and under- 
standing of the needs and capabilities of individuals and communities from their 
information usage and communication interaction patterns. Social intelligence deliv- 
ers social information which may be used to improve other processes. At its 
simplest. Social Intelligence can be seen as a social markup process on actors and 
communication acts which provides social information as part of the pragmatic 
dimension of communication. For instance, consider the recognition of “hubs” in 
emergency situations such as during a hurricane. Providing these well-connected 
individuals with critical information will reach a broad set of people rapidly with 
minimal communication requirements, because the hubs are the individuals that 



^ http://www.delicious.com. 
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spread messages most effectively. Or, consider the identification of authorities. In 
media intelligence or mass intelligence, content of these users should receive more 
emphasis and attention. 

Watzlawicks’s communication model Watzlawick, Beavin, & Jackson (1974) 
serves as the base for deriving the social intelligence layer. Social intelligence 
consists of three interconnected layers, namely the content layer of communi- 
cation messages, the meta-layer of communication messages, and the structural 
information layer derived from social interaction which represents the state of the 
communication process in a community. 

Social network analysis Wasserman & Faust (1999) and for directed social inter- 
actions a Hermitian eigensystem-analysis (Hoser & Geyer-Schulz, 2005) serve as 
methods for analyzing the interaction structure of social networks. For the seman- 
tic interpretation, social concepts from general sociology are used to interpret and 
assign meaning to the results of such an analysis. Modal extensions to the knowledge 
representation layer should be considered (see, e.g., Blackburn, 2002 and Fitting & 
Mendelsohn, 1999) For example, when doing an eigensystem-analysis of the link 
structure of the WWW, hubs and authorities are identified as concepts: hubs mean 
in that context web-sites that link to many relevant web-sites with the function to 
act as a multiplier, whereas authorities are the relevant web-sites with the important 
content. However, when analyzing the link structure in a social network site, the 
social concept changes, depending on the culture of the network, the relevant social 
concept may be that of friends, colleagues, or acquaintances. Furthermore, for the 
qualification of such interactions, the social position, role, rank of a person in his 
community is of high importance, e.g., for marketing purposes. Visualized social 
structures may also serve as innovative interfaces to Internet communication ser- 
vices which ease the process of communicating with members of this social group. 

Analysis at the meta-communication layer needs a strong link to media- 
intelligence: For example, digital audio streams coming to a emergency call-center 
may be analyzed for emotions. Recognizing emotions may help in evaluating the 
urgency of the situation. In an other setting, pictures about holiday resorts may be 
classified according to their emotional appeal. The exact wording of messages con- 
tains hints on the social background of the sender, so does the pronunciation of 
speech. 



2,5 Organizational Intelligence 

In contrast to Personal Intelligence, the Organizational Intelligence deals with the 
sharing of knowledge between the individual members of an organization. As a 
consequence, the role of Organizational Intelligence is to bring the right piece of 
knowledge at the right time to the right person of the organization in order to support 
decision making. This knowledge is not necessarily only produced by individuals, 
but rather by the interaction with Personal, Media, Mass, and Social Intelligence. 
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The persons addressed with Organizational Intelligence can be either within the 
organization or external but involved in organizational processes. 

Professional organizations such as enterprises and governmental agencies have 
strong and often legally enforceable rules and boundaries. The association of per- 
sons to the organization or parts of the organization is typically clearly defined such 
as membership in the R&D department, human resources, etc. In addition, the role 
within that organization is typically known and well defined like head of group or 
silver command in emergency response. In contrast, non-professional organizations 
are only loosely coupled. The association of persons to the non-professional organi- 
zation can be fuzzy such as being member of a neighborhood community or a group 
of friends. In addition, the roles may not be clearly defined in non-professional orga- 
nizations. For example, for a group of friends that is planning to spend a weekend 
trip to a foreign city it is typically not clearly defined who takes the organizer, leader, 
etc., role in the group. 

The goal of Organizational Intelligence is to best support professional organi- 
zations as well as non-professional organizations in carrying out their tasks and 
achieving their goals. To this end, we analyzed knowledge management processes 
in professional organizations and Web 2.0 communities (Scherp, Schwagereit, & 
Ireson, 2009). We identified their relations and propose professional organizations 
that are enhanced with Web 2.0 communities (Scherp, Schwagereit, Ireson, & Lan- 
franchi, 2009) and allow for exploiting the information from those communities for 
organizational purposes. By this marriage, new chances and prospects arise with 
respect to the collaboration of the users in Web 2.0 communities and the entities 
in professional organizations. However, also potential hazards and new challenges 
arise in terms of privacy, trust, and reputation issues. 



3 Use Cases 

The proposed system will demonstrate the wide applicability of its technologies 
through the design, implementation and evaluation of two heterogeneous case stud- 
ies: an Emergency Response and a Consumers Social Group case study. These 
studies are complementary in terms of intended users, potential business model and 
social impact. Nevertheless, both of them will naturally build on top of the deployed 
technologies, and a common architecture. 



3.1 Emergency Response Case Study 

The Emergency Response case study aims to develop technologies and interaction 
modalities to better support professional users (i.e.. Emergency Response personnel) 
and citizens involved in an emergency, by providing means to intelligently gather 
and reuse available knowledge, thus empowering a more effective and informed 
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emergency action. With the increased use of mobile devices and digital cameras, 
people have become accustomed to capture events and share the information (e.g., 
the BBC^ news website for contributing to the news by uploading pictures or com- 
ments). We will design, implement and deliver technologies and methodologies 
enabling citizens distributed across the region to participate in the monitoring of 
an incident or event. This will benefit Emergency Response planners who will have 
real time information available on which they can base their decisions and strategies, 
enabling them to better react to an Emergency. Moreover, the system will automat- 
ically gather information available elsewhere on the network to aid the Emergency 
Response, thus making possible for an emergency planner to find exactly the 
needed knowledge amongst all the available information and to selectively make 
this knowledge available to the citizens (e.g., information about open roads, infor- 
mation about relatives involved) in a largely automated way. The technologies in use 
will therefore also encourage and enable dialogue between the Emergency Respon- 
ders and Individuals, groups and communities The case study will be based on the 
research results of all Intelligence Layers, with Media, Mass and Social Intelli- 
gence analysing the user-submitted content. Personal Intelligence allowing easier 
upload and distribution of content to the end users and Organisational Intelligence 
providing all the extracted knowledge to the Emergency Responders. 



3.2 Consumers Social Group Case Study 

A case study of the consumer social group application will be the implementation 
of a travel planner and guide. This application is intended as a travel adviser during 
the whole lifecycle of a trip: from planning until the end of it. This web-based ser- 
vice will draw travel-related information from multiple sources, such as: history of 
previous trips (including information extracted from trip reports, like photographs, 
videos, or text and voice annotations), user feedbacks from blogs about visits of 
other people, or information available on the internet. Analysis of this informa- 
tion at different levels of intelligence leads to collective intelligence that can assist 
the service users in deciding and scheduling their future trip. These results will be 
based on user-posed criteria, such as destination preferences, time-of-the-year sug- 
gestions, social group members with whom to travel, logistics and cost of travel 
and stay, special enquiries about events, attractions, museums, restaurants, shop- 
ping, nightlife, etc. Once the group agrees on their trip plan, the users can carry the 
details of it during their trip, in their portable devices (e.g., PDA, smart phones). 
At the time of their trip, the users can ask the system for recommendations about 
specific locations, events, etc., also with reference to the vicinity of their current 
location. Besides, they can use their mobile devices in order to capture instances of 
their trip (through photos, video footages, annotations, etc.). This content can either 
be uploaded to the system’s server, at once, or be locally stored in the device, so as 
to be synchronized with the server at a later time. Once the trip comes to an end, the 
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users can post their report about their trip experiences, at the ease of their desk, for 
instance. This information will feed the system with new material to be processed 
and analysed, for future trip planning. 



4 Conclusions 

In this paper novel techniques for exploiting multiple layers of intelligence from 
user-contributed content have been presented. These layers constitute together the 
Collective Intelligence. Collective Intelligence provides added value to the available 
information, enabling the accomplishment of tasks that are not possible otherwise 
and time reduction and enhanced efficiency in existing procedures and workflows. 
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Computation of the Molenaar Sijtsma Statistic 



L. Andries van der Ark 



Abstract The Molenaar Sijtsma statistic is an estimate of the reliability of a test 
score. In some special cases, computation of the Molenaar Sijtsma statistic requires 
provisional measures. These provisional measures have not been fully described in 
the literature, and we show that they have not been implemented in the software. 
We describe the required provisional measures as to allow the computation of the 
Molenaar Sijtsma statistic for all data sets. 

Keywords Molenaar Sijtsma statistic ■ Psychological test construction • Reliability. 



1 Introduction 

Psychological and educational tests are often used for the classification of respon- 
dents. For example, a clinical psychologist may decide that one patient needs special 
treatment and another patient does not, based on their scores on a psychological test; 
and the decision of an admission committee of a university may strongly depend on 
the student’s score on an educational test. A valid classification requires that the 
test scores are reliable, which can be investigated by a reliability statistic. Most 
well known reliability statistics (e.g., Cronbach’s alpha, Cronbach, 1951; lambda-2, 
Guttman, 1945; the greatest lower bounds, Jackson & Agunwamba, 1977) are lower 
bounds to the reliability. The Molenaar Sijtsma statistic (MS) (Molenaar & Sijtsma, 
1984, 1988; Sijtsma, 1988; Sijtsma & Molenaar, 1987) gives a direct estimate of the 
reliability of a test score. Simulation studies showed that MS was almost unbiased 
and had less bias and smaller variance than other reliability statistics (Sijtsma & 
Molenaar, 1987; Van der Ark & Van der Palm, 2007). Therefore, MS gives a more 
accurate estimate of the reliability than other well known reliability statistics. MS is 
implemented in the software package MSP5.0 (Molenaar & Sijtsma, 2000). 
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In some special cases MS cannot be computed straightforwardly and provisional 
measures are required (Molenaar & Sijtsma, 1988) but these have never been dis- 
cussed in detail and, as we show in Sect. 4, have not been implemented in the 
software package MSP5.0. Therefore, the researcher is left in the dark what to do 
in these special cases. This paper discusses all details of the computation of MS, 
so as to allow the computation of MS in all cases. For reasons of space we do not 
discuss details of the rationale of MS and its background theory. We refer the inter- 
ested reader to Molenaar and Siitsma (1984, 1988), Siitsma (1988), and Siitsma and 
Molenaar (1987). 

Assume that a test consists of J items, indexed by i and j . Each item has 
m+\ ordered answer categories 0, . . . , m; indexed by g and h. The items scores are 
denoted by Xi, ... ,Xj. Assume that N respondents have responded to the J items 
and there are no missing values. For each respondent the test score A = ^ A, is 
used for classification. In classical test theory the expected value of a respondent’s 
test score over independent replications is called the true score and is denoted by T 
(Lord & Novick, 1968). T is unobservable. Let a^(A) and o^(T) denote the popula- 
tion variance of the test score and the true score, respectively. Under the assumptions 
of the classical test theory, the reliability of A is defined as pxx' = o'^(7’)/ct^(A) 
(Lord & Novick). Since a^{T) is unobservable, the reliability cannot be computed 
directly and must be estimated. 

Let Jtg{i) = P{Xi > g) denote the marginal cumulative probability of obtain- 
ing a score of at least g on item /, and let 7tg(iyh(j) = P(^i > S’^j — 
denote the joint cumulative probability of obtaining a score of at least g on item 
i and at least h on item j . Molenaar and Sijtsma (1988) showed that o^(T) = 
L/=i E™=i T,j = i T,h=i [^g(i)Mj) - ^g(i) X ^h(j)l and, therefore, the reliability of 
A can be expressed as 



MS estimates the reliability of A by plugging in estimates for each term in (1). 
The following estimates are straightforward because they only depend on observable 
item scores: 

• The population variance of the test score, a^(A), is estimated by the (biased) 
sample variance 



• The marginal cumulative probabilities jrg(i) and 7th{j) are estimated by the cor- 
responding marginal cumulative proportions in the sample, denoted Pgii) and 
Phij), respectively. 

• If i 7 ^ j , the joint cumulative probabilities Jtg{i)^h(j) are estimated by the 
corresponding observable joint cumulative proportions in the sample, denoted 




(1) 
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If i = j , 7tg(i),h{i) is the joint probability of obtaining at least score g and at least 
score h on item i in two independent replications. Estimation is not straightforward 
because the corresponding joint cumulative proportions in the sample are unob- 
servable in a single test administration. Two cases are distinguished. In Case I, 
there are no marginal cumulative proportions with exactly the same values. Case 
I, which requires no provisional measures, is discussed in Sect. 2. In Case II, one 
or more marginal cumulative proportions have exactly the same value. Case II, 
which requires provisional measures, is discussed in Sect. 3. In Sect. 4, we show 
that MSP5.0 can produce an incorrect MS. 



2 Case I: The Computation of MS When No Provisional 
Measures Are Needed 

Case I is explained using the first numerical example, which consists of four items, 
each with three ordered categories. Table 1 shows the marginal cumulative propor- 
tions. Marginal cumulative proportions Po{i) (i = 1, . . . , / ) equal 1 by definition 
and are not informative. 

The first step in estimating the unobservable joint cumulative probabilities 
7ig(i) h{i) is to rank all the informative marginal cumulative proportions from small 
to large. For the first numerical example. Table 1 shows that this rank order is 

Pl(4) < ^2(3) < ^2(2) < ^2(1) < Pl(4) < ^1(3) < ^1(2) < A(l)- (2) 

The second step in estimating the joint cumulative probabilities Ttg(i) h(i) is to 
create a matrix of joint cumulative proportions in which the rows and columns are 
ordered by the size of the corresponding marginal cumulative proportions [cf. (2) in 
the first step]. Table 2 shows this matrix of joint cumulative proportions for the first 
numerical example. NA indicates that a joint cumulative proportion is unobservable 
and must estimated. Assume that joint cumulative proportion T’^(, )./,{, ) is in the cell 
with row r and column c. For convenience, Pg(i),h{i) is denoted Pr,c and the corre- 
sponding marginal cumulative probabilities Pr and Pc, respectively. For example, 
^ 2 ( 4 ), 1 ( 4 ) is in row 1 and column 5 of Table 2 and is, therefore, denoted P\ s- 

The third step in estimating the unobservable joint cumulative probability 
define: 

1. The lowerneighboring joint cumulative proportion: Pio = Pr+i^c- 

2. The right-hand neighboring joint cumulative proportion: P^ = Pr,c+i- 





i = 1 


i = 2 


i = 3 


; = 4 


To(,) 


1.00 


1.00 


1.00 


1.00 


Ti(i) 


0.90 


0.80 


0.70 


0.60 


Pui) 


0.50 


0.40 


0.30 


0.20 



Table 1 Marginal cumula- 
tive proportions of the first 
numerical example 
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Table 2 Marginal cumulative proportions (boldface) and joint cumulative proportions of the first 
numerical example 







Pm 

0.20 


Pm 

0.30 


Pm 

0.40 


Pm 

0.50 


Ti(4) 

0.60 


Pm 

0.70 


Pm 

0.80 


Pui) 

0.90 


Pm 


0.20 


NA 


0.20 


0.20 


0.20 


NA 


0.20 


0.20 


0.20 


Pi(}) 


0.30 


0.20 


NA 


0.30 


0.30 


0.30 


NA 


0.30 


0.30 


Pm 


0.40 


0.20 


0.30 


NA 


0.40 


0.40 


0.40 


NA 


0.40 


Pid) 


0.50 


0.20 


0.30 


0.40 


NA 


0.50 


0.50 


0.50 


NA 


Pm 


0.60 


NA 


0.30 


0.40 


0.50 


NA 


0.60 


0.60 


0.60 


Pm 


0.70 


0.20 


NA 


0.40 


0.50 


0.60 


NA 


0.70 


0.70 


Pm 


0.80 


0.20 


0.30 


NA 


0.50 


0.60 


0.70 


NA 


0.80 


Pim 


0.90 


0.20 


0.30 


0.40 


NA 


0.60 


0.70 


0.80 


NA 



3. The upperneighboring joint cumulative proportion: Pup = Pr-i,c- 

4. The left-hand neighboring joint cumulative proportion: Pie = Pr,c-\- 

It may be noted that not all four neighboring joint cumulative proportions need exist. 
For example, for P\ s, Pup does not exist, Pio = 0.30, Pie = 0.20, and Pri = 0.20. 

The fourth step is to estimate the unobservable joint cumulative probability 
ttg(i),h(i) eight times using the following eight different estimates (see, Molenaar 
& Sijtsma, 1988, for the derivation). 



p(i) 

^ r.c 



>( 2 ) 



Pic 

Pri 



p(3) _ p 

^ r.c ~ up 



P^:’ = Pie 
PS = Pic 



Pr 
Pr + l 
Pc 

Pc+l 
Pr 
Pr-l 
Pc 
Pc-l 
l-Pr 
1 - Pr+l 

I- Pc 



p(6) _ p 

t-r.c - p 



e+1 

^ 7 ^ 1 - Pr 

pU) — p L 

^r.c — ^up _ p 



Pc 
Pr 
+ Pc 



ita 1 - Pc 

p(8) _ p c ^ p 

^r.c ~ ^le , n T J r 
1 — Cc-l 



Pr + l - Pr 
1 - Pr+l 
Pc+l - Pc 
1 - Pc+l 
Pr - Pr-l 
1 - Pr-l 
Pc - Pc-l 
1 - Pc-l 



( 3 ) 

( 4 ) 

( 5 ) 

(6) 

( 7 ) 

( 8 ) 
( 9 ) 

( 10 ) 



Joint cumulative probability Jtg(i)j,(i) is then estimated by Pr.c, the mean of all 
existing estimates in (3)-(10). For the first numerical example, it may be noted that 
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P 

P 

P 

P 

P 

P 

P 

P 



( 1 ) 

1.5 

(2) 

1.5 

(3) 

1.5 

(4) 
1.5 

(5) 
1.5 

(6) 
1.5 

(7) 
1.5 

( 8 ) 
1.5 



0.2 

= 0.3 X — = 0.2 
0.3 

0.6 

= 0.2x — = 0.1714 
0.7 

does not exist 

0.6 

= 0.2 X — = 0.24 
0.5 

1 - 0.2 

= 0.3 X 0.6 X 

1-0.3 

1 - 0.6 

= 0.2 X 0.2 X 

1-0.7 

does not exist 

1 - 0.6 

= 0.2 X ^ + 0.2 X 



1-0.5 



0.3 -0.2 

1-0.3 
0.7 -0.6 

1-0.7 



0.2571 

0.2 



0.6 -0.5 
1-0.5 



= 0.2 



As a result 



Pl.5 = 



0.2 + 0.1714 + 0.24 + 0.2571 + 0.2 + 0.2 
6 



0.2114 



It was noted by Molenaar and Sijtsma (1988) that Pr.c should lie in the interval 



P, Pc < Pr.c < min(/’r, Pc)- 



For Fi, 5 , the lower bound equals 0.2 x 0.6 = 0.12 and the upper bound equals 
min(0.2.0.6) = 0.2. Hence, the final estimate for 7ri(4),2(4) = 0.2. Table 3 shows 
the joint cumulative proportions of the first numerical example, with all estimated 
unobservable joint cumulative proportions underlined. The joint cumulative propor- 
tions in Table 3 are plugged into (1). Suppose that S^(X) = 9. Using the values in 
Table 3 it may then be verified that 



Table 3 Marginal cumulative proportions (boldface) and joint cumulative proportions of the 
first numerical example. Estimated unobservable joint cumulative probabilities (accuracy in three 
digits) are underlined 







Pm 

0.20 


Pm 

0.30 


P2(2) 

0.40 


Pm 

0.50 


Ti(4) 

0.60 


Pm) 

0.70 


P\(2) 

0.80 


Ti ( i ) 

0.90 


^2(4) 


0.20 


0.167 


0.20 


0.20 


0.20 


0.200 


0.20 


0.20 


0.20 


Pl(3) 


0.30 


0.20 


0.259 


0.30 


0.30 


0.30 


0.300 


0.30 


0.30 


P2(2) 


0.40 


0.20 


0.30 


0.359 


0.40 


0.40 


0.40 


0.400 


0.40 


P2(l) 


0.50 


0.20 


0.30 


0.40 


0.458 


0.50 


0.50 


0.50 


0.500 


Pm 


0.60 


0.200 


0.30 


0.40 


0.50 


0.559 


0.60 


0.60 


0.60 


Pm 


0.70 


0.20 


0.300 


0.40 


0.50 


0.60 


0.659 


0.70 


0.70 


Ti(2) 


0.80 


0.20 


0.30 


0.400 


0.50 


0.60 


0.70 


0.761 


0.80 


Pw) 


0.90 


0.20 


0.30 


0.40 


0.500 


0.60 


0.70 


0.80 


0.875 
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J m J m 

^^ = EEEE 

1=1 J=1 j=i h=l 



^g(i)Mj) Pg(i) ^ PH]) 
S^{X) 



7.137 

9 



0.793. 



3 Case II: The Computation of MS When Provisional 
Measures Are Needed 

The following citation taken from Molenaar and Sijtsma (1988) illustrates that 
Sect. 2 is not sufficient for computing MS in all cases: 

Furthermore, alternative approximations methods are used when Pg^i) or Ph{j) or both, 
belong to a string of identical proportions. In such cases the choice of adjacent elements 
becomes problematic. Since the discussion of the solutions to this problem would take much 
space, we prefer to give only a brief outline 

A detailed discussion of the solutions is presented here. In Case II, T’g(,) or Ph(j) 
or both may belong to a string of identical proportions Case II is explained using 
a the second numerical example consisting of four items, each with three ordered 
categories. The second numerical example contains two strings of identical marginal 
cumulative proportions (Table 4). 

As in Case I, the marginal cumulative proportions are put in an ascending 
order. For the second numerical example the rank order of the cumulative marginal 
proportions is 

{7*2(4), 7*2(3)} < 7*2(2) < 7*2(1) < 7*1(4) < {7*1(3), 7*i(2), 7*i(i)}. 

There is no unique order of the cumulative marginal proportions and, therefore, 
there is no unique order of the rows and columns of the matrix of joint cumulative 
proportions (Table 5). The order of rows 1 and 2; the order of columns 1 and 2; the 
order of rows 6, 7, and 8; and the order of columns 6, 7, and 8 are undetermined. As 
a result, the neighboring joint cumulative proportions Pio, Pie, P,i, and P^p cannot 
be estimated unambiguously. 

In general, four types of cells can be distinguished in the matrix of joint cumula- 
tive probabilities: 

1 : A cell whose row and column are in an arbitrary order. 

2: A cell whose column is in an arbitrary order and whose row is in a unique 
order. 





/■ = 1 


i = 2 


i = 3 


/■ = 4 


Pad) 


1.00 


1.00 


1.00 


1.00 


7*1(0 


0.60 


0.60 


0.60 


0.50 


7*2(i) 


0.40 


0.30 


0.20 


0.20 



Table 4 Marginal cumula- 
tive proportions of the second 
numerical example 
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Table 5 Marginal cumulative proportions (boldface) and joint cumulative proportions of the 
second numerical example 







^2(4) 

0.20 


Pm 

0.20 


Pi(i) 

0.30 


Pm 

0.40 


Pm 

0.50 


Pm) 

0.60 


Pi(i) 

0.60 


Ti(i) 

0.60 


^2(4) 


0.20 


NA 


0.20 


0.20 


0.20 


NA 


0.20 


0.20 


0.20 


PlO) 


0.20 


0.20 


NA 


0.20 


0.20 


0.20 


NA 


0.20 


0.20 


Pl(2) 


0.30 


0.20 


0.20 


NA 


0.30 


0.30 


0.30 


NA 


0.30 


Pl(l) 


0.40 


0.20 


0.20 


0.30 


NA 


0.40 


0.40 


0.40 


NA 


Ti(4) 


0.50 


NA 


0.20 


0.30 


0.40 


NA 


0.50 


0.50 


0.50 


Pm 


0.60 


0.20 


NA 


0.30 


0.40 


0.50 


NA 


0.60 


0.60 


Pm 


0.60 


0.20 


0.20 


NA 


0.40 


0.50 


0.60 


NA 


0.60 


Pm 


0.60 


0.20 


0.20 


0.30 


NA 


0.50 


0.60 


0.60 


NA 



Table 6 Types and sets of cells of Table 5. Cells pertaining to unobservable joint cumulative 
proportions are underlined. Marginal cumulative proportions are in boldface 

^2(4) Pi(i) Pui) Pn\) Pl(A) P\0) Pl(2) ^’l(l) 

0.20 0.20 0.30 0.40 0.50 0.60 0.60 0.60 



^2(4) 


0.20 


1 


1 


3 


3 


3 


1 


1 


1 


PlO) 


0.20 


1 


1 


3 


3 


3 


1 


1 


1 


Pm 


0.30 


2 


2 


4 


4 


4 


2 


2 


2 


Pm 


0.40 


2 


2 


4 


4 


4 


2 


2 


2 


Pm 


0.50 


2 


2 


4 


4 


4 


2 


2 


2 


Pm) 


0.60 


1 


1 


3 


3 


3 


1 


1 


1 


Pm) 


0.60 


1 


1 


3 


3 


3 


1 


1 


1 


Pm) 


0.60 


1 


1 


3 


3 


3 


1 


1 


1 



3: A cell whose row is in an arbitrary order and whose column is in a unique 
order. 

4: A cell whose row and column are in a unique order. 

Table 6 shows the type of cell for each cumulative joint proportion in Table 5. If two 
or more adjacent joint cumulative proportions have the same marginal cumulative 
proportions, then we say that the corresponding cells in the matrix of joint cumula- 
tive proportions belong to the same set. In Table 6, if two cells are not separated by 
a line, then the cells belong to the same set. For example, i, P\ 2 , Pi.x, and P 22 
belong to the same set. 

In the computation of neighboring joint cumulative proportions, sets rather than 
cells are considered. The neighboring joint cumulative proportions are defined 
differently for different types of cells. 

• If an unobserved cumulative joint probability is in a cell of Type 1, Pup, Pio, 
Pri, and Pie are undetermined and set equal to the mean of all observed joint 
cumulative proportions in the set. 

• If an unobserved cumulative joint probability is in a cell of Type 2, Pri and Pie 
are set equal to the mean of all observed joint cumulative proportions in the set. 
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Pup is set equal to the mean of all observed joint cumulative proportions in the set 
above the cell, and Pio is set equal to the mean of all observed joint cumulative 
proportions in the set below the cell. If a set does not exist, the corresponding 
neighboring joint cumulative proportion does not exist. 

• If an unobserved cumulative joint probability is in a cell of Type 3, Pup and Pio 
are set equal to the mean of all observed joint cumulative proportions in the set. 
Pri is set equal to the mean of all observed joint cumulative proportions in the set 
right of the cell, and Pu is set equal to the mean of all observed joint cumulative 
proportions in the set left of the cell. If a set does not exist, the corresponding 
neighboring joint cumulative proportion does not exist. 

• If an unobserved cumulative joint probability is in a cell of Type 4, Pup is set 
equal to the mean of all observed joint cumulative proportions in the set above the 
cell, Pio is set equal to the mean of all observed joint cumulative proportions in 
the set below the cell, P^i is set equal to the mean of all observed joint cumulative 
proportions in the set right of the cell. Pie is set equal to the mean of all observed 
joint cumulative proportions in the set left of the cell. If a set does not exist, the 
corresponding neighboring joint cumulative proportion does not exist. 

Applying these rules to the second numerical example yields the following neigh- 
boring joint cumulative proportions: 

For F2(4).2(4), Pup = Pio = Pri = Pie = 0.2. 

For Fi(4),2(4), Pup = Pio = 0.2, Pri = 0.2, and Pu = 0.2. 

For F2(3),2(3)> Pup = Pio = Pri = Pie = 0.2. 

For Fi(3),2{3), Pup = Pio = Pri = Pie = 0.2. 

For F2(2),2(2), Pup = 0.2, Pio = 0.3, Pri = 0.3, and Pu = 0.2. 

For Fi( 2 ). 2 ( 2 ), Pup = 0.2 and Pio = 0.4, = Pie = 0.3. 

For F 2 ( 1 ), 2 ( 1 ), Pup = 0.3, Pio = 0.4, Pri = 0.4, and Pu = 0.3. 

For Fi(i), 2 (i), Pup = 0.3, and Pio = 0.5, Pri = Pie = 0.4. 

For /’i( 4 ),i( 4 ). Pup = 0.4, Pio = 0.5, Pri = 0.5, and Pie = 0.4. 

For Fi(3)4(3), Pup = Pio = Pri = Pie = 0.6. 

For Fi( 2 ), 1 ( 2 ), Pup = Pio = Pri = Pie = 0.6. 

For Fi(i),i(i), Pup = Pio = Pri = Pie = 0.6. 

Once the neighboring joint cumulative proportions have been computed, the unob- 
servable joint cumulative proportions can be estimated using (3) through (10) and 
the same procedure as in Case I (Sect. 2) can be used to compute MS. 



4 Estimation of the Unobservable Joint Cumulative 
Probabilities in MSP5.0 

The third numerical example, containing four items with two ordered answer cate- 
gories, shows that the provisional measures described in this note are not applied in 
MSP5.0 (Molenaar & Sijtsma, 2000). The matrix of joint cumulative proportions is 
shown in Table 7. 
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Table 7 Marginal cumulative proportions (boldface) and joint cumulative proportions of the third 
numerical example. The proportions rounded in two decimals were taken from MSP5.0 output. 
Unobservable proportions are underlined 







Pi(i) 

0.40 


Pi (2) 

0.60 


Pl(3) 

0.60 


Pi (4) 

0.60 


Pl(5) 

0.70 


Ti(i) 


0.40 


0.33 


0.40 


0.30 


0.40 


0.30 


Pm 


0.60 


0.40 


0.40 


0.30 


0.50 


0.40 


Ti(3) 


0.60 


0.30 


0.30 


0.36 


0.40 


0.50 


■Pi (4) 


0.60 


0.40 


0.50 


0.40 


0.45 


0.50 


Pi (5) 


0.70 


0.30 


0.40 


0.50 


0.50 


0.57 



Applying the rules computing the neighboring joint cumulative proportions 
(Sect. 3, p.781) to fi,! yields the following results. Pup and Pie do not exist, 



Plo 



0 . 4 + 0 . 4 + 0.3 



= 0.367. Pri = 



0 . 4 + 0 . 4 + 0.3 

3 



0.367. Hence, 



ad) 

1.1 



o(2) 

1.1 



0.4 

0.367 X — 

0.6 



0.244, 



/5\ 1 — 0.4 0.6 — 0.4 

= p/7 = 0.367 X 0.4 X 

1 - 0.6 1 - 0.6 



= 0.35, 



and and do not exist. Thus, the correct estimate of :7ri(i),i(i) is 

Pu = 0-2^^4+035 ^ Q 297 xhe incorrect value, = 0.33 reported by MSP5.0 
(Table 7), is obtained when one ignores that /’i( 2 ) = T’i( 3 ) = T’i( 4 ) and, in addition, 
when Pi( 2 ) is treated as the only neighboring marginal cumulative proportion (cf. 
Case I, Sect. 2). This results in Pri = Pio = 0.40. Applying (3) and (4) yield 
= 0.26667, and applying (7) and (8) yield Pf/ = P^^’ = 0.4. The 
average of these four estimates equals the value 0.33 produced by MSP5.0. It may 
be noted this problem occurs for all unobservable joint cumulative proportions in 
Table 7. 



5 Discussion 

The description of the provisional measures required for the computation of MS in 
special cases hlls a gap in the literature on this statistic. The provisional measures 
are required (Molenaar & Sijtsma, 1984, 1988; Sijtsma & Molenaar, 1987) but were 
not discussed in detail, and were not incorporated in the software. The explana- 
tions in this paper make it possible to compute MS correctly for future applications. 
For example, as of 2009, the function check, reliability in the R package 
mokken (Van der Ark, 2007) computes MS correctly. 

There are two reasons to assume that effect of the flaw in the MSP5.0 software is 
very small or negligible for practical situations. First, it only applies to the special 
case where two or more marginal cumulative proportions are identical. If sample 
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sizes are sufficiently large, the probability that this happens is rather small. Second, 
only one or a few of all the joint cumulative proportions needed for the computation 
of MS are affected by this flaw. 
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